WordPress / openverse-catalog

Identifies and collects data on cc-licensed content across web crawl data and public apis.
https://openverse.org
MIT License
59 stars 54 forks source link

Use Python to group items by license to speed up the query #1045

Closed obulat closed 1 year ago

obulat commented 1 year ago

Fixes

Fixes WordPress/openverse#1270

Description

This PR updates the second step in the DAG to replace many SELECT queries with SELECTing all items with NULL in meta_data, then grouping of the items to update using Python function, and then running one UPDATE query per license pair (if necessary). Hopefully, this should be much faster than the previous run.

This PR also sets the trigger rule for the last step, because otherwise it was skipped if the second step was skipped.

Testing instruction

Same as WordPress/openverse-catalog#1005.

AetherUnbound commented 1 year ago

Heh, what a great test set to have! 😄

obulat commented 1 year ago

Apologies for the late review -- this looks excellent!

I have been doing Flickr testing locally and happened to have ~1.55 million records locally 😅 I set the meta_data to null for all of them and the DAG ran successfully in just over six minutes :)

Screen Shot 2023-03-15 at 11 49 43 AM

Thank you for your testing, it's more than what I could hope for, @stacimc 😆 I would not have guessed that the most popular license would be by-nd-nc/2.0/jp

obulat commented 1 year ago

I ran the DAG in prod, and this is the log output:

[2023-03-16, 03:45:26 UTC] {slack.py:321} INFO - 
`add_license_url` DAG run completed.
Update statistics:
https://creativecommons.org/licenses/by-nc-sa/3.0/: 6318 rows
https://creativecommons.org/licenses/by-nd/3.0/: 201 rows
https://creativecommons.org/licenses/by-sa/3.0/: 1545 rows
https://creativecommons.org/licenses/by-nc/3.0/: 2607 rows
https://creativecommons.org/publicdomain/zero/1.0/: 2919 rows
https://creativecommons.org/publicdomain/mark/1.0/: 374 rows
https://creativecommons.org/licenses/by-nc-sa/1.0/: 166 rows
https://creativecommons.org/licenses/by-nc-sa/2.0/: 2165 rows
https://creativecommons.org/licenses/by-nc-sa/2.5/: 14 rows
https://creativecommons.org/licenses/by-sa/2.5/: 25 rows
https://creativecommons.org/licenses/by-sa/2.0/: 213 rows
https://creativecommons.org/licenses/by-nc/4.0/: 26 rows
https://creativecommons.org/licenses/by-sa/4.0/: 7 rows
https://creativecommons.org/licenses/by-nd-nc/2.0/jp/: 98 rows
https://creativecommons.org/licenses/by-nd/2.0/: 16 rows
https://creativecommons.org/licenses/by-sa/1.0/: 1 rows

Now, there are 9382 records with NULL meta_data left.

From the logs, here are some problems with these 9382 records:

  1. Incorrectly capitalized license name: "CC0" instead of "cc0", "PDM" instead of "pdm"
    [2023-03-16, 03:06:40 UTC] {add_license_url.py:102} INFO - No license pair (CC0, 1.0) in the license map.
    [2023-03-16, 03:06:47 UTC] {add_license_url.py:102} INFO - No license pair (PDM, 1.0) in the license map.
  2. Incorrect version number: CC0 and PDM only have version 1.0
    [2023-03-16, 03:06:43 UTC] {add_license_url.py:102} INFO - No license pair (cc0, 3.0) in the license map.
    [2023-03-16, 03:06:43 UTC] {add_license_url.py:102} INFO - No license pair (pdm, 3.0) in the license map.