WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
247 stars 199 forks source link

Refactor Metropolitan Museum to use ProviderDataIngester #1519

Closed stacimc closed 2 years ago

stacimc commented 2 years ago

Suggested Improvement

Refactor the Metropolitan museum provider script to use the new ProviderDataIngester base class

Benefit

More details in WordPress/openverse-catalog#229

Implementation

rwidom commented 2 years ago

I started working on this, and found a couple of things from their repo documentation.

The csv is small enough that it would be easy enough to process with pandas. But I don't know how often it gets updated, it doesn't have image urls, etc. I'd be happy to reach out to someone at the museum, but I wonder if it would be more effective coming from staff. Tagging @AetherUnbound , @stacimc and @sarayourfriend for your thoughts.

AetherUnbound commented 2 years ago

Thanks for looking into this! What were you thinking of asking the museum?

AetherUnbound commented 2 years ago

Making a note here that we'll want to be sure we incorporate the image titles in this refactor, see WordPress/openverse#1487

rwidom commented 2 years ago

Thanks for looking into this! What were you thinking of asking the museum?

How often they update that extract (to evaluate the csv route), and if they have any more specific guidelines for rate limits that we could be using (if the API is really the better option).

Making a note here that we'll want to be sure we incorporate the image titles in this refactor, see WordPress/openverse#1487

Yes, will do! And I think we should at least consider adding the artist name and some other things to tags. I think it's currently under creator, but I don't know that creator is currently part of elastic search. Is it? So this only brings back results from flickr, even though the artist is in some of our existing test cases, and definitely at the met.

I have a first draft using the API workflow, that I could publish, but it needs some more testing, which I'm kind of blocked on. :/

AetherUnbound commented 2 years ago

Looking at this once more - based on the fact that the images are explicitly not included in the dataset, the foreign_identifiers we use for the met don't exist in the dataset, and the last update to the dataset was 5 months ago (whereas our daily runs still frequently pull in data), I think our best bet is to stick with the API for now. They mention to kindly use no more than 80 requests per second, which I believe we're already following in production. That to say we'll leave as-is for now, and move forward with your data ingester class refactor!