Closed zackkrida closed 2 years ago
We could add a DAG tag onto them that shows they're borked, but I like the idea of removing them from the active DAG list for now until we can get them corrected.
We should mkdir dags/borked
and ignore it in Airflow lol
Fine by me :smile:
This is going to be related to the work in WordPress/openverse#1661
Looks like Wikimedia Common's DAG isn't set to the right schedule :thinking: I may be interpreting things wrong, but I've created a ticket for it WordPress/openverse#1643
Raw Pixel's API is giving us a 404:
[2022-02-24, 02:11:16 UTC] {dag_factory.py:130} INFO - Running provider function
[2022-02-24, 02:11:16 UTC] {raw_pixel.py:181} INFO - Begin: RawPixel API requests
[2022-02-24, 02:11:16 UTC] {raw_pixel.py:25} INFO - Processing request: https://api.rawpixel.com/api/v1/search
[2022-02-24, 02:11:17 UTC] {requester.py:49} WARNING - Unable to request URL: https://api.rawpixel.com/api/v1/search?freecc0=1&html=0&page=1. Status code: 404
[2022-02-24, 02:11:17 UTC] {raw_pixel.py:32} WARNING - Unable to request URL: https://api.rawpixel.com/api/v1/search. Status code: 404
[2022-02-24, 02:11:17 UTC] {raw_pixel.py:187} INFO - Total images: 0
[2022-02-24, 02:11:17 UTC] {raw_pixel.py:191} INFO - Terminated!
Perhaps we need an API key? When I visit api.rawpixel.com I get Access Denied
Walters looks like we may need an API key as well:
[2022-02-24, 02:17:06 UTC] {requester.py:47} ERROR - Authorization failed for URL: https://api.thewalters.org/v1/objects?accept=json&pageSize=100&orderBy=classification&classification=Miniatures&Page=1
[2022-02-24, 02:17:06 UTC] {requester.py:83} WARNING - Bad response_json: None
[2022-02-24, 02:17:06 UTC] {requester.py:84} WARNING - Retrying:
_get_response_json(
https://api.thewalters.org/v1/objects,
{'accept': 'json', 'pageSize': 100, 'orderBy': 'classification', 'classification': 'Miniatures', 'apikey': None, 'Page': 1},
retries=-1)
Looks like API key needed for Brooklyn museum as well:
[2022-02-25, 00:31:31 UTC] {brooklyn_museum.py:37} INFO - Begin: Brooklyn museum provider script
[2022-02-25, 00:31:31 UTC] {requester.py:49} WARNING - Unable to request URL: https://www.brooklynmuseum.org/api/v2/object/?has_images=1&rights_type_permissive=1&limit=35&offset=0. Status code: 500
[2022-02-25, 00:31:32 UTC] {requester.py:49} WARNING - Unable to request URL: https://www.brooklynmuseum.org/api/v2/object/?has_images=1&rights_type_permissive=1&limit=35&offset=0. Status code: 500
[2022-02-25, 00:31:33 UTC] {requester.py:49} WARNING - Unable to request URL: https://www.brooklynmuseum.org/api/v2/object/?has_images=1&rights_type_permissive=1&limit=35&offset=0. Status code: 500
When I hit that endpoint I get a 500 with Access denied. Your API key doesn't have sufficient privileges for this API operation.
For each of the following, the pull_data
task was successful but load_data
failed due to duplicate key errors:
Museum Victoria pulled 0 images, getting 403s:
[2022-02-25, 18:38:14 UTC] {requester.py:49} WARNING - Unable to request URL: https://collections.museumsvictoria.com.au/api/search?has_image=yes&perpage=100&imagelicence=cc+by-nc-nd&page=0. Status code: 403
[2022-02-25, 18:38:19 UTC] {requester.py:49} WARNING - Unable to request URL: https://collections.museumsvictoria.com.au/api/search?has_image=yes&perpage=100&imagelicence=cc+by-nc-nd&page=0. Status code: 403
[2022-02-25, 18:38:24 UTC] {requester.py:49} WARNING - Unable to request URL: https://collections.museumsvictoria.com.au/api/search?has_image=yes&perpage=100&imagelicence=cc+by-nc-nd&page=0. Status code: 403
403 would imply API key? But I also checked the API docs and noticed that it looks like the correct parameter is hasimages
rather than has_image
: https://collections.museumsvictoria.com.au/developers
API key needed for NYPL:
[2022-02-25, 19:09:00 UTC] {dag_factory.py:130} INFO - Running provider function
[2022-02-25, 19:09:00 UTC] {requester.py:47} ERROR - Authorization failed for URL: http://api.repo.nypl.org/api/v1/items/search?q=CC_0&field=use_rtxt_s&page=1&per_page=500
[2022-02-25, 19:09:01 UTC] {requester.py:47} ERROR - Authorization failed for URL: http://api.repo.nypl.org/api/v1/items/search?q=CC_0&field=use_rtxt_s&page=1&per_page=500
[2022-02-25, 19:09:02 UTC] {requester.py:47} ERROR - Authorization failed for URL: http://api.repo.nypl.org/api/v1/items/search?q=CC_0&field=use_rtxt_s&page=1&per_page=500
Missing API key for the Smithsonian:
_get_response_json(
https://api.si.edu/openaccess/api/v1.0/search,
{'api_key': None, 'rows': 1000, 'q': 'online_media_type:Images AND media_usage:CC0 AND hash:00*', 'start': 0},
retries=2)
The DAGs have been run! Let's start with the good news:
These provider DAGs were completely successful 🎉 🥳 :
We also successfully got the audio data from Wikimedia Commons, and a good chunk of data from Freesound (more on that later).
And now for the bad news. Outstanding issues:
These DAGs failed due to Missing API Keys:
These DAGs successfully pulled data, but failed in the loading step due to duplicate key errors:
And these DAGs had some more strange errors:
pull_data
task successfully pulled thousands of records, but then failed during _get_audio_file_size
. Tellingly, this function in the provider script contains the warning: "Freesound can be a bit finicky, so we want to retry it a few times" 😄 INFO - Obtaining Images of building 0/Museovirasto/
step for 24 hours and ran out the timeout.I'm going to create issues for the above and record them here so this can be a tracking issue.
Update: I'll actually create a new tracking issue as I believe that work should be part of the v1.2.0 milestone. I'll link it back here once it's created.
Description
Some DAGs are broken. This can happen for a few reasons:
In any case, we should identify and (re)move currently-broken DAGs. Perhaps moving them to
archive/
or somewhere that indicates that they do not currently work but should be revised.