Closed obulat closed 1 year ago
Heh, what a great test set to have! 😄
Apologies for the late review -- this looks excellent!
I have been doing Flickr testing locally and happened to have ~1.55 million records locally 😅 I set the
meta_data
to null for all of them and the DAG ran successfully in just over six minutes :)
Thank you for your testing, it's more than what I could hope for, @stacimc 😆 I would not have guessed that the most popular license would be by-nd-nc/2.0/jp
I ran the DAG in prod, and this is the log output:
[2023-03-16, 03:45:26 UTC] {slack.py:321} INFO -
`add_license_url` DAG run completed.
Update statistics:
https://creativecommons.org/licenses/by-nc-sa/3.0/: 6318 rows
https://creativecommons.org/licenses/by-nd/3.0/: 201 rows
https://creativecommons.org/licenses/by-sa/3.0/: 1545 rows
https://creativecommons.org/licenses/by-nc/3.0/: 2607 rows
https://creativecommons.org/publicdomain/zero/1.0/: 2919 rows
https://creativecommons.org/publicdomain/mark/1.0/: 374 rows
https://creativecommons.org/licenses/by-nc-sa/1.0/: 166 rows
https://creativecommons.org/licenses/by-nc-sa/2.0/: 2165 rows
https://creativecommons.org/licenses/by-nc-sa/2.5/: 14 rows
https://creativecommons.org/licenses/by-sa/2.5/: 25 rows
https://creativecommons.org/licenses/by-sa/2.0/: 213 rows
https://creativecommons.org/licenses/by-nc/4.0/: 26 rows
https://creativecommons.org/licenses/by-sa/4.0/: 7 rows
https://creativecommons.org/licenses/by-nd-nc/2.0/jp/: 98 rows
https://creativecommons.org/licenses/by-nd/2.0/: 16 rows
https://creativecommons.org/licenses/by-sa/1.0/: 1 rows
Now, there are 9382 records with NULL meta_data left.
From the logs, here are some problems with these 9382 records:
[2023-03-16, 03:06:40 UTC] {add_license_url.py:102} INFO - No license pair (CC0, 1.0) in the license map.
[2023-03-16, 03:06:47 UTC] {add_license_url.py:102} INFO - No license pair (PDM, 1.0) in the license map.
[2023-03-16, 03:06:43 UTC] {add_license_url.py:102} INFO - No license pair (cc0, 3.0) in the license map.
[2023-03-16, 03:06:43 UTC] {add_license_url.py:102} INFO - No license pair (pdm, 3.0) in the license map.
Fixes
Fixes WordPress/openverse#1270
Description
This PR updates the second step in the DAG to replace many
SELECT
queries withSELECT
ing all items withNULL
in meta_data, then grouping of the items to update using Python function, and then running oneUPDATE
query per license pair (if necessary). Hopefully, this should be much faster than the previous run.This PR also sets the trigger rule for the last step, because otherwise it was skipped if the second step was skipped.
Testing instruction
Same as WordPress/openverse-catalog#1005.