WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
247 stars 199 forks source link

Audit Provider scripts and associated DAGs #1687

Closed zackkrida closed 2 years ago

zackkrida commented 2 years ago

Description

Some DAGs are broken. This can happen for a few reasons:

In any case, we should identify and (re)move currently-broken DAGs. Perhaps moving them to archive/ or somewhere that indicates that they do not currently work but should be revised.

AetherUnbound commented 2 years ago

We could add a DAG tag onto them that shows they're borked, but I like the idea of removing them from the active DAG list for now until we can get them corrected.

zackkrida commented 2 years ago

We should mkdir dags/borked and ignore it in Airflow lol

AetherUnbound commented 2 years ago

Fine by me :smile:

AetherUnbound commented 2 years ago

This is going to be related to the work in WordPress/openverse#1661

AetherUnbound commented 2 years ago

Looks like Wikimedia Common's DAG isn't set to the right schedule :thinking: I may be interpreting things wrong, but I've created a ticket for it WordPress/openverse#1643

AetherUnbound commented 2 years ago

Raw Pixel's API is giving us a 404:

[2022-02-24, 02:11:16 UTC] {dag_factory.py:130} INFO - Running provider function
[2022-02-24, 02:11:16 UTC] {raw_pixel.py:181} INFO - Begin: RawPixel API requests
[2022-02-24, 02:11:16 UTC] {raw_pixel.py:25} INFO - Processing request: https://api.rawpixel.com/api/v1/search
[2022-02-24, 02:11:17 UTC] {requester.py:49} WARNING - Unable to request URL: https://api.rawpixel.com/api/v1/search?freecc0=1&html=0&page=1.  Status code: 404
[2022-02-24, 02:11:17 UTC] {raw_pixel.py:32} WARNING - Unable to request URL: https://api.rawpixel.com/api/v1/search. Status code: 404
[2022-02-24, 02:11:17 UTC] {raw_pixel.py:187} INFO - Total images: 0
[2022-02-24, 02:11:17 UTC] {raw_pixel.py:191} INFO - Terminated!

Perhaps we need an API key? When I visit api.rawpixel.com I get Access Denied

AetherUnbound commented 2 years ago

Walters looks like we may need an API key as well:

[2022-02-24, 02:17:06 UTC] {requester.py:47} ERROR - Authorization failed for URL: https://api.thewalters.org/v1/objects?accept=json&pageSize=100&orderBy=classification&classification=Miniatures&Page=1
[2022-02-24, 02:17:06 UTC] {requester.py:83} WARNING - Bad response_json:  None
[2022-02-24, 02:17:06 UTC] {requester.py:84} WARNING - Retrying:
_get_response_json(
    https://api.thewalters.org/v1/objects,
    {'accept': 'json', 'pageSize': 100, 'orderBy': 'classification', 'classification': 'Miniatures', 'apikey': None, 'Page': 1},
    retries=-1)
stacimc commented 2 years ago

Looks like API key needed for Brooklyn museum as well:

[2022-02-25, 00:31:31 UTC] {brooklyn_museum.py:37} INFO - Begin: Brooklyn museum provider script
[2022-02-25, 00:31:31 UTC] {requester.py:49} WARNING - Unable to request URL: https://www.brooklynmuseum.org/api/v2/object/?has_images=1&rights_type_permissive=1&limit=35&offset=0.  Status code: 500
[2022-02-25, 00:31:32 UTC] {requester.py:49} WARNING - Unable to request URL: https://www.brooklynmuseum.org/api/v2/object/?has_images=1&rights_type_permissive=1&limit=35&offset=0.  Status code: 500
[2022-02-25, 00:31:33 UTC] {requester.py:49} WARNING - Unable to request URL: https://www.brooklynmuseum.org/api/v2/object/?has_images=1&rights_type_permissive=1&limit=35&offset=0.  Status code: 500

When I hit that endpoint I get a 500 with Access denied. Your API key doesn't have sufficient privileges for this API operation.

stacimc commented 2 years ago

For each of the following, the pull_data task was successful but load_data failed due to duplicate key errors:

stacimc commented 2 years ago

Museum Victoria pulled 0 images, getting 403s:

[2022-02-25, 18:38:14 UTC] {requester.py:49} WARNING - Unable to request URL: https://collections.museumsvictoria.com.au/api/search?has_image=yes&perpage=100&imagelicence=cc+by-nc-nd&page=0.  Status code: 403
[2022-02-25, 18:38:19 UTC] {requester.py:49} WARNING - Unable to request URL: https://collections.museumsvictoria.com.au/api/search?has_image=yes&perpage=100&imagelicence=cc+by-nc-nd&page=0.  Status code: 403
[2022-02-25, 18:38:24 UTC] {requester.py:49} WARNING - Unable to request URL: https://collections.museumsvictoria.com.au/api/search?has_image=yes&perpage=100&imagelicence=cc+by-nc-nd&page=0.  Status code: 403

403 would imply API key? But I also checked the API docs and noticed that it looks like the correct parameter is hasimages rather than has_image: https://collections.museumsvictoria.com.au/developers

stacimc commented 2 years ago

API key needed for NYPL:

[2022-02-25, 19:09:00 UTC] {dag_factory.py:130} INFO - Running provider function
[2022-02-25, 19:09:00 UTC] {requester.py:47} ERROR - Authorization failed for URL: http://api.repo.nypl.org/api/v1/items/search?q=CC_0&field=use_rtxt_s&page=1&per_page=500
[2022-02-25, 19:09:01 UTC] {requester.py:47} ERROR - Authorization failed for URL: http://api.repo.nypl.org/api/v1/items/search?q=CC_0&field=use_rtxt_s&page=1&per_page=500
[2022-02-25, 19:09:02 UTC] {requester.py:47} ERROR - Authorization failed for URL: http://api.repo.nypl.org/api/v1/items/search?q=CC_0&field=use_rtxt_s&page=1&per_page=500
stacimc commented 2 years ago

Missing API key for the Smithsonian:

_get_response_json(
    https://api.si.edu/openaccess/api/v1.0/search,
    {'api_key': None, 'rows': 1000, 'q': 'online_media_type:Images AND media_usage:CC0 AND hash:00*', 'start': 0},
    retries=2)
stacimc commented 2 years ago

The DAGs have been run! Let's start with the good news:

These provider DAGs were completely successful 🎉 🥳 :

We also successfully got the audio data from Wikimedia Commons, and a good chunk of data from Freesound (more on that later).

And now for the bad news. Outstanding issues:

These DAGs failed due to Missing API Keys:

These DAGs successfully pulled data, but failed in the loading step due to duplicate key errors:

And these DAGs had some more strange errors:

I'm going to create issues for the above and record them here so this can be a tracking issue.

stacimc commented 2 years ago

Update: I'll actually create a new tracking issue as I believe that work should be part of the v1.2.0 milestone. I'll link it back here once it's created.