Audit Provider scripts and associated DAGs

zackkrida commented 2 years ago

Description

Some DAGs are broken. This can happen for a few reasons:

A provider releases a new version of their API
A scraping-based provider changes the HTML structure of their site

In any case, we should identify and (re)move currently-broken DAGs. Perhaps moving them to archive/ or somewhere that indicates that they do not currently work but should be revised.

AetherUnbound commented 2 years ago

We could add a DAG tag onto them that shows they're borked, but I like the idea of removing them from the active DAG list for now until we can get them corrected.

zackkrida commented 2 years ago

We should mkdir dags/borked and ignore it in Airflow lol

AetherUnbound commented 2 years ago

Fine by me :smile:

AetherUnbound commented 2 years ago

This is going to be related to the work in WordPress/openverse#1661

AetherUnbound commented 2 years ago

Looks like Wikimedia Common's DAG isn't set to the right schedule :thinking: I may be interpreting things wrong, but I've created a ticket for it WordPress/openverse#1643

AetherUnbound commented 2 years ago

Raw Pixel's API is giving us a 404:

[2022-02-24, 02:11:16 UTC] {dag_factory.py:130} INFO - Running provider function
[2022-02-24, 02:11:16 UTC] {raw_pixel.py:181} INFO - Begin: RawPixel API requests
[2022-02-24, 02:11:16 UTC] {raw_pixel.py:25} INFO - Processing request: https://api.rawpixel.com/api/v1/search
[2022-02-24, 02:11:17 UTC] {requester.py:49} WARNING - Unable to request URL: https://api.rawpixel.com/api/v1/search?freecc0=1&html=0&page=1.  Status code: 404
[2022-02-24, 02:11:17 UTC] {raw_pixel.py:32} WARNING - Unable to request URL: https://api.rawpixel.com/api/v1/search. Status code: 404
[2022-02-24, 02:11:17 UTC] {raw_pixel.py:187} INFO - Total images: 0
[2022-02-24, 02:11:17 UTC] {raw_pixel.py:191} INFO - Terminated!

Perhaps we need an API key? When I visit api.rawpixel.com I get Access Denied

AetherUnbound commented 2 years ago

Walters looks like we may need an API key as well:

[2022-02-24, 02:17:06 UTC] {requester.py:47} ERROR - Authorization failed for URL: https://api.thewalters.org/v1/objects?accept=json&pageSize=100&orderBy=classification&classification=Miniatures&Page=1
[2022-02-24, 02:17:06 UTC] {requester.py:83} WARNING - Bad response_json:  None
[2022-02-24, 02:17:06 UTC] {requester.py:84} WARNING - Retrying:
_get_response_json(
    https://api.thewalters.org/v1/objects,
    {'accept': 'json', 'pageSize': 100, 'orderBy': 'classification', 'classification': 'Miniatures', 'apikey': None, 'Page': 1},
    retries=-1)

stacimc commented 2 years ago

Looks like API key needed for Brooklyn museum as well:

[2022-02-25, 00:31:31 UTC] {brooklyn_museum.py:37} INFO - Begin: Brooklyn museum provider script
[2022-02-25, 00:31:31 UTC] {requester.py:49} WARNING - Unable to request URL: https://www.brooklynmuseum.org/api/v2/object/?has_images=1&rights_type_permissive=1&limit=35&offset=0.  Status code: 500
[2022-02-25, 00:31:32 UTC] {requester.py:49} WARNING - Unable to request URL: https://www.brooklynmuseum.org/api/v2/object/?has_images=1&rights_type_permissive=1&limit=35&offset=0.  Status code: 500
[2022-02-25, 00:31:33 UTC] {requester.py:49} WARNING - Unable to request URL: https://www.brooklynmuseum.org/api/v2/object/?has_images=1&rights_type_permissive=1&limit=35&offset=0.  Status code: 500

When I hit that endpoint I get a 500 with Access denied. Your API key doesn't have sufficient privileges for this API operation.

stacimc commented 2 years ago

For each of the following, the pull_data task was successful but load_data failed due to duplicate key errors:

Wikimedia Commons
Cleveland Museum
Metropolitan Museum

stacimc commented 2 years ago

Museum Victoria pulled 0 images, getting 403s:

[2022-02-25, 18:38:14 UTC] {requester.py:49} WARNING - Unable to request URL: https://collections.museumsvictoria.com.au/api/search?has_image=yes&perpage=100&imagelicence=cc+by-nc-nd&page=0.  Status code: 403
[2022-02-25, 18:38:19 UTC] {requester.py:49} WARNING - Unable to request URL: https://collections.museumsvictoria.com.au/api/search?has_image=yes&perpage=100&imagelicence=cc+by-nc-nd&page=0.  Status code: 403
[2022-02-25, 18:38:24 UTC] {requester.py:49} WARNING - Unable to request URL: https://collections.museumsvictoria.com.au/api/search?has_image=yes&perpage=100&imagelicence=cc+by-nc-nd&page=0.  Status code: 403

403 would imply API key? But I also checked the API docs and noticed that it looks like the correct parameter is hasimages rather than has_image: https://collections.museumsvictoria.com.au/developers

stacimc commented 2 years ago

API key needed for NYPL:

[2022-02-25, 19:09:00 UTC] {dag_factory.py:130} INFO - Running provider function
[2022-02-25, 19:09:00 UTC] {requester.py:47} ERROR - Authorization failed for URL: http://api.repo.nypl.org/api/v1/items/search?q=CC_0&field=use_rtxt_s&page=1&per_page=500
[2022-02-25, 19:09:01 UTC] {requester.py:47} ERROR - Authorization failed for URL: http://api.repo.nypl.org/api/v1/items/search?q=CC_0&field=use_rtxt_s&page=1&per_page=500
[2022-02-25, 19:09:02 UTC] {requester.py:47} ERROR - Authorization failed for URL: http://api.repo.nypl.org/api/v1/items/search?q=CC_0&field=use_rtxt_s&page=1&per_page=500

stacimc commented 2 years ago

Missing API key for the Smithsonian:

_get_response_json(
    https://api.si.edu/openaccess/api/v1.0/search,
    {'api_key': None, 'rows': 1000, 'q': 'online_media_type:Images AND media_usage:CC0 AND hash:00*', 'start': 0},
    retries=2)

stacimc commented 2 years ago

The DAGs have been run! Let's start with the good news:

These provider DAGs were completely successful 🎉 🥳 :

Jamendo
WordPress Photo Directory
Stocksnap
Staten Museum
Science Museum

We also successfully got the audio data from Wikimedia Commons, and a good chunk of data from Freesound (more on that later).

And now for the bad news. Outstanding issues:

These DAGs failed due to Missing API Keys:

Smithsonian
NYPL
Brooklyn Museum
Walters
Raw Pixel

These DAGs successfully pulled data, but failed in the loading step due to duplicate key errors:

Wikimedia Commons (when loading image data)
Cleveland Museum
Metropolitan Museum

And these DAGs had some more strange errors:

Freesound: the pull_data task successfully pulled thousands of records, but then failed during _get_audio_file_size. Tellingly, this function in the provider script contains the warning: "Freesound can be a bit finicky, so we want to retry it a few times" 😄
Museum Victoria looks like it may be using an outdated API
PhyloPic's logs appear to show that it's hitting the correct API using the correct date, but it reports that it found no content when it looks like content exists
Finnish Museum stalled on the INFO - Obtaining Images of building 0/Museovirasto/ step for 24 hours and ran out the timeout.

I'm going to create issues for the above and record them here so this can be a tracking issue.

stacimc commented 2 years ago

Update: I'll actually create a new tracking issue as I believe that work should be part of the v1.2.0 milestone. I'll link it back here once it's created.

WordPress / openverse

Audit Provider scripts and associated DAGs #1687

Description