WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
256 stars 204 forks source link

Flickr results do not use "raw" (human readable) tags #4906

Open zackkrida opened 2 months ago

zackkrida commented 2 months ago

Description

Images ingested into Openverse from Flickr are using Flickr tags in a non-optimal way. Observe the following Openverse result's tags:

https://openverse.org/image/ea4dff9b-7337-47ab-9fac-c9c4bd7860a9

Screenshot from 2024-09-10 11-14-16

As you can plainly see, many of the tags are multi-word phrases that are compressed into single words with spaces removed. For example:

When viewing the result on Flickr, the tags look correct:

image

So, what is going on?

Well, the search endpoint in Flickr, which we use in our Flickr dag, returns the "cleaned" version of the tags. These are the version used in urls and as identifiers on Flickr, as documented here:

https://www.flickr.com/services/api/misc.tags.html

When querying the single result for an image with Flickr's getImage endpoint, like so:

http https://api.flickr.com/services/rest method==flickr.photos.getInfo api_key=={redacted} photo_id==2750282427 format==json nojsoncallback==1 | jq '.photo.tags.tag[].raw'

You can see that the "raw" human-readable tags are avaliable:

"great depression" "national archives" "recession" "depression" "cardboard house" "cotton dress" "poor" "financial ruin" "economic disaster" "sharecroppers" "the grapes of wrath" "Tom Joad" "the crisis" "le crise" "la crisis" "coca-cola" "1930" "Farm Security Administration-Office of War Information Collection" "FSA-OWI" "Jack Whinery" "homesteaders" "Pie Town, New Mexico" "Evan Lawrence Bench"

It is these tags we should be using in Openverse.

This presents a technical challenge to us in that these tags are only accessible via single results.

Here is the payload for a single tag, from the list of tags returned by getImage:

id  "2045382-2750282427-19380346"
author  "19762676@N00"
authorname  "austinevan"
raw "Pie Town, New Mexico"
_content    "pietownnewmexico"
machine_tag 0

Edit: I also just noticed that tags.getListPhoto might be a better endpoint to use, as it only returns tags:

http https://api.flickr.com/services/rest method==flickr.tags.getListPhoto api_key=={redacted} photo_id==2750282427 format==json nojsoncallback==1

sarayourfriend commented 2 months ago

This presents a technical challenge to us in that these tags are only accessible via single results.

Sounds like it might be relevant to the #4452 work, where we are specifically building the ability to pull from the single results endpoint in Flickr. After that, we will be able to backfill for existing works...

Otherwise, would pulling the tags list per result at ingestion time be the way to go for newly ingested works?

zackkrida commented 2 months ago

@sarayourfriend it is most certainly relevant! I think it's very likely that whatever solution is adopted in #4452 would be the most appropriate way to solve this problem. So much of the thinking there is applicable here, including the importance of preserving the original tags, in some fashion.

Otherwise, if we did want to fix this particular issue at ingestion time, we would need to make a decision if the number of API calls to Flickr would be appropriate. Here are some quick stats from the last run of the Flickr provider DAG, keeping in mind our 3600 requests per hour limit from Flickr:

We'd then need to make 105,794 individual requests which would take ~30hrs, assuming I understand Flickr's rate limiting correctly. If we naively assumed that daily, we give 1hr to the Flickr dag, and 23 hrs to retrieving tags, we could retrieve tags for 82,800 (3600 api calls * 23 hrs) records a day.

It's probably worth connecting with Flickr to confirm the rate limiting; I can't recall if we have any unique permissions or anything like that.

zackkrida commented 2 months ago

Oh and of course, I'm very curious to hear from @WordPress/openverse-catalog here.

sarayourfriend commented 2 months ago

We'd then need to make 105,794 individual requests which would take ~30hrs, assuming I understand Flickr's rate limiting correctly. If we naively assumed that daily, we give 1hr to the Flickr dag, and 23 hrs to retrieving tags, we could retrieve tags for 82,800 (3600 api calls * 23 hrs) records a day.

In other words, to clarify, we would have a daily, compounding deficit of 24k works to pull tags for daily. Said another way, we would be 1.5 * N hours behind on Flickr, perpetually, where N is the number of days since we started ingesting raw tags.

It would be really great to know if there's some way Flickr could enable access to the raw tags in the bulk endpoints. Doesn't seem like it's tenable to make an individual request per image, neither for a backfill using the tools from #4452 nor during initial ingestion of new works. It would prevent any other Flickr operations from happening (like targeted reingestion), because we'd be eating up our api quota at all times on pulling tags.


For posterity, there is also flickr.tags.getListPhoto for getting just the list of tags for a photo (rather than the image's full info, which may be excessive).

Example from the image you shared:

<?xml version="1.0" encoding="utf-8" ?>
<rsp stat="ok">
  <photo id="2750282427">
    <tags>
      <tag id="2045382-2750282427-56404" author="19762676@N00" authorname="austinevan" raw="great depression" machine_tag="0">greatdepression</tag>
      <tag id="2045382-2750282427-130348" author="19762676@N00" authorname="austinevan" raw="national archives" machine_tag="0">nationalarchives</tag>
      <tag id="2045382-2750282427-378739" author="19762676@N00" authorname="austinevan" raw="recession" machine_tag="0">recession</tag>
      <tag id="2045382-2750282427-16400" author="19762676@N00" authorname="austinevan" raw="depression" machine_tag="0">depression</tag>
      <tag id="2045382-2750282427-953073" author="19762676@N00" authorname="austinevan" raw="cardboard house" machine_tag="0">cardboardhouse</tag>
      <tag id="2045382-2750282427-3045287" author="19762676@N00" authorname="austinevan" raw="cotton dress" machine_tag="0">cottondress</tag>
      <tag id="2045382-2750282427-6925" author="19762676@N00" authorname="austinevan" raw="poor" machine_tag="0">poor</tag>
      <tag id="2045382-2750282427-7870191" author="19762676@N00" authorname="austinevan" raw="financial ruin" machine_tag="0">financialruin</tag>
      <tag id="2045382-2750282427-10644475" author="19762676@N00" authorname="austinevan" raw="economic disaster" machine_tag="0">economicdisaster</tag>
      <tag id="2045382-2750282427-3141545" author="19762676@N00" authorname="austinevan" raw="sharecroppers" machine_tag="0">sharecroppers</tag>
      <tag id="2045382-2750282427-1169161" author="19762676@N00" authorname="austinevan" raw="the grapes of wrath" machine_tag="0">thegrapesofwrath</tag>
      <tag id="2045382-2750282427-3332918" author="19762676@N00" authorname="austinevan" raw="Tom Joad" machine_tag="0">tomjoad</tag>
      <tag id="2045382-2750282427-6365467" author="19762676@N00" authorname="austinevan" raw="the crisis" machine_tag="0">thecrisis</tag>
      <tag id="2045382-2750282427-36940796" author="19762676@N00" authorname="austinevan" raw="le crise" machine_tag="0">lecrise</tag>
      <tag id="2045382-2750282427-24814089" author="19762676@N00" authorname="austinevan" raw="la crisis" machine_tag="0">lacrisis</tag>
      <tag id="2045382-2750282427-23464" author="19762676@N00" authorname="austinevan" raw="coca-cola" machine_tag="0">cocacola</tag>
      <tag id="2045382-2750282427-123582" author="19762676@N00" authorname="austinevan" raw="1930" machine_tag="0">1930</tag>
      <tag id="2045382-2750282427-55789592" author="19762676@N00" authorname="austinevan" raw="Farm Security Administration-Office of War Information Collection" machine_tag="0">farmsecurityadministrationofficeofwarinformationcollection</tag>
      <tag id="2045382-2750282427-1778336" author="19762676@N00" authorname="austinevan" raw="FSA-OWI" machine_tag="0">fsaowi</tag>
      <tag id="2045382-2750282427-14992932" author="19762676@N00" authorname="austinevan" raw="Jack Whinery" machine_tag="0">jackwhinery</tag>
      <tag id="2045382-2750282427-4174958" author="19762676@N00" authorname="austinevan" raw="homesteaders" machine_tag="0">homesteaders</tag>
      <tag id="2045382-2750282427-19380346" author="19762676@N00" authorname="austinevan" raw="Pie Town, New Mexico" machine_tag="0">pietownnewmexico</tag>
      <tag id="2045382-2750282427-132732812" author="19762676@N00" authorname="austinevan" raw="Evan Lawrence Bench" machine_tag="0">evanlawrencebench</tag>
    </tags>
  </photo>
</rsp>

Could we maintain our own reverse index of tags? TL;DR: Nope, not without major drawbacks. I was thinking of how reliable it would be if we maintained our own reverse index of Flickr's processed tags to machine tags. If it were, then we'd only needed to request tags for a photo if one of the tags it had were not already in our own index. Of course, that relies on a reverse index being reliable, and the fact that apparently "the grapes of wrath" and a mistyped "thegrapes of wrath" would both turn into "thegrapesofwrath" calls that reliability into question. If you consider beyond English language, then I'm sure there are a lot of examples of entirely different tags in different languages normalise to the same Flickr-processed tag. Even in English: "a moral behaviour" and "amoral behaviour" are (to a large extent) opposites! There are probably ways of deciding the language of a work's tags based on other indications, but the example photo has tags in English, French, and Spanish. The trade-offs would be huge. But if it's the only way we could do it without getting a generous grant from Flickr to be able to pull tags more rapidly, maybe we'd need to accept those trade-offs for the benefits we would see "most of the time". The worst case scenario is potentially very bad though, and maybe even worse than the current situation.

@zackkrida maybe a good opportunity to reach out to our Flickr contacts via CC? Maybe there are ways other than regular API calls we could get access to tags, which wouldn't play into our rate limit.

stacimc commented 2 months ago

Some quick thoughts:

...Actually, if we go down this route I think we could delete the Flickr reingestion DAG and instead only do backfills like #4452 🤔 As in, rather than trying to run ingestion for past ingestion dates (which we know is not effective for backfilling because of the issues with Flickr's API), we run reingestion on sets of image ids from our own catalog.