Open zackkrida opened 2 months ago
This presents a technical challenge to us in that these tags are only accessible via single results.
Sounds like it might be relevant to the #4452 work, where we are specifically building the ability to pull from the single results endpoint in Flickr. After that, we will be able to backfill for existing works...
Otherwise, would pulling the tags list per result at ingestion time be the way to go for newly ingested works?
@sarayourfriend it is most certainly relevant! I think it's very likely that whatever solution is adopted in #4452 would be the most appropriate way to solve this problem. So much of the thinking there is applicable here, including the importance of preserving the original tags, in some fashion.
Otherwise, if we did want to fix this particular issue at ingestion time, we would need to make a decision if the number of API calls to Flickr would be appropriate. Here are some quick stats from the last run of the Flickr provider DAG, keeping in mind our 3600 requests per hour limit from Flickr:
We'd then need to make 105,794 individual requests which would take ~30hrs, assuming I understand Flickr's rate limiting correctly. If we naively assumed that daily, we give 1hr to the Flickr dag, and 23 hrs to retrieving tags, we could retrieve tags for 82,800 (3600 api calls * 23 hrs) records a day.
It's probably worth connecting with Flickr to confirm the rate limiting; I can't recall if we have any unique permissions or anything like that.
Oh and of course, I'm very curious to hear from @WordPress/openverse-catalog here.
We'd then need to make 105,794 individual requests which would take ~30hrs, assuming I understand Flickr's rate limiting correctly. If we naively assumed that daily, we give 1hr to the Flickr dag, and 23 hrs to retrieving tags, we could retrieve tags for 82,800 (3600 api calls * 23 hrs) records a day.
In other words, to clarify, we would have a daily, compounding deficit of 24k works to pull tags for daily. Said another way, we would be 1.5 * N
hours behind on Flickr, perpetually, where N
is the number of days since we started ingesting raw tags.
It would be really great to know if there's some way Flickr could enable access to the raw tags in the bulk endpoints. Doesn't seem like it's tenable to make an individual request per image, neither for a backfill using the tools from #4452 nor during initial ingestion of new works. It would prevent any other Flickr operations from happening (like targeted reingestion), because we'd be eating up our api quota at all times on pulling tags.
For posterity, there is also flickr.tags.getListPhoto
for getting just the list of tags for a photo (rather than the image's full info, which may be excessive).
Example from the image you shared:
<?xml version="1.0" encoding="utf-8" ?>
<rsp stat="ok">
<photo id="2750282427">
<tags>
<tag id="2045382-2750282427-56404" author="19762676@N00" authorname="austinevan" raw="great depression" machine_tag="0">greatdepression</tag>
<tag id="2045382-2750282427-130348" author="19762676@N00" authorname="austinevan" raw="national archives" machine_tag="0">nationalarchives</tag>
<tag id="2045382-2750282427-378739" author="19762676@N00" authorname="austinevan" raw="recession" machine_tag="0">recession</tag>
<tag id="2045382-2750282427-16400" author="19762676@N00" authorname="austinevan" raw="depression" machine_tag="0">depression</tag>
<tag id="2045382-2750282427-953073" author="19762676@N00" authorname="austinevan" raw="cardboard house" machine_tag="0">cardboardhouse</tag>
<tag id="2045382-2750282427-3045287" author="19762676@N00" authorname="austinevan" raw="cotton dress" machine_tag="0">cottondress</tag>
<tag id="2045382-2750282427-6925" author="19762676@N00" authorname="austinevan" raw="poor" machine_tag="0">poor</tag>
<tag id="2045382-2750282427-7870191" author="19762676@N00" authorname="austinevan" raw="financial ruin" machine_tag="0">financialruin</tag>
<tag id="2045382-2750282427-10644475" author="19762676@N00" authorname="austinevan" raw="economic disaster" machine_tag="0">economicdisaster</tag>
<tag id="2045382-2750282427-3141545" author="19762676@N00" authorname="austinevan" raw="sharecroppers" machine_tag="0">sharecroppers</tag>
<tag id="2045382-2750282427-1169161" author="19762676@N00" authorname="austinevan" raw="the grapes of wrath" machine_tag="0">thegrapesofwrath</tag>
<tag id="2045382-2750282427-3332918" author="19762676@N00" authorname="austinevan" raw="Tom Joad" machine_tag="0">tomjoad</tag>
<tag id="2045382-2750282427-6365467" author="19762676@N00" authorname="austinevan" raw="the crisis" machine_tag="0">thecrisis</tag>
<tag id="2045382-2750282427-36940796" author="19762676@N00" authorname="austinevan" raw="le crise" machine_tag="0">lecrise</tag>
<tag id="2045382-2750282427-24814089" author="19762676@N00" authorname="austinevan" raw="la crisis" machine_tag="0">lacrisis</tag>
<tag id="2045382-2750282427-23464" author="19762676@N00" authorname="austinevan" raw="coca-cola" machine_tag="0">cocacola</tag>
<tag id="2045382-2750282427-123582" author="19762676@N00" authorname="austinevan" raw="1930" machine_tag="0">1930</tag>
<tag id="2045382-2750282427-55789592" author="19762676@N00" authorname="austinevan" raw="Farm Security Administration-Office of War Information Collection" machine_tag="0">farmsecurityadministrationofficeofwarinformationcollection</tag>
<tag id="2045382-2750282427-1778336" author="19762676@N00" authorname="austinevan" raw="FSA-OWI" machine_tag="0">fsaowi</tag>
<tag id="2045382-2750282427-14992932" author="19762676@N00" authorname="austinevan" raw="Jack Whinery" machine_tag="0">jackwhinery</tag>
<tag id="2045382-2750282427-4174958" author="19762676@N00" authorname="austinevan" raw="homesteaders" machine_tag="0">homesteaders</tag>
<tag id="2045382-2750282427-19380346" author="19762676@N00" authorname="austinevan" raw="Pie Town, New Mexico" machine_tag="0">pietownnewmexico</tag>
<tag id="2045382-2750282427-132732812" author="19762676@N00" authorname="austinevan" raw="Evan Lawrence Bench" machine_tag="0">evanlawrencebench</tag>
</tags>
</photo>
</rsp>
@zackkrida maybe a good opportunity to reach out to our Flickr contacts via CC? Maybe there are ways other than regular API calls we could get access to tags, which wouldn't play into our rate limit.
Some quick thoughts:
...Actually, if we go down this route I think we could delete the Flickr reingestion DAG and instead only do backfills like #4452 🤔 As in, rather than trying to run ingestion for past ingestion dates (which we know is not effective for backfilling because of the issues with Flickr's API), we run reingestion on sets of image ids from our own catalog.
Description
Images ingested into Openverse from Flickr are using Flickr tags in a non-optimal way. Observe the following Openverse result's tags:
https://openverse.org/image/ea4dff9b-7337-47ab-9fac-c9c4bd7860a9
As you can plainly see, many of the tags are multi-word phrases that are compressed into single words with spaces removed. For example:
When viewing the result on Flickr, the tags look correct:
So, what is going on?
Well, the search endpoint in Flickr, which we use in our Flickr dag, returns the "cleaned" version of the tags. These are the version used in urls and as identifiers on Flickr, as documented here:
https://www.flickr.com/services/api/misc.tags.html
When querying the single result for an image with Flickr's getImage endpoint, like so:
You can see that the "raw" human-readable tags are avaliable:
"great depression" "national archives" "recession" "depression" "cardboard house" "cotton dress" "poor" "financial ruin" "economic disaster" "sharecroppers" "the grapes of wrath" "Tom Joad" "the crisis" "le crise" "la crisis" "coca-cola" "1930" "Farm Security Administration-Office of War Information Collection" "FSA-OWI" "Jack Whinery" "homesteaders" "Pie Town, New Mexico" "Evan Lawrence Bench"
It is these tags we should be using in Openverse.
This presents a technical challenge to us in that these tags are only accessible via single results.
Here is the payload for a single tag, from the list of tags returned by getImage:
Edit: I also just noticed that tags.getListPhoto might be a better endpoint to use, as it only returns tags:
http https://api.flickr.com/services/rest method==flickr.tags.getListPhoto api_key=={redacted} photo_id==2750282427 format==json nojsoncallback==1