WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
215 stars 177 forks source link

Some recently updated images are missing `license_url` in the `meta_data` field #4318

Open krysal opened 1 month ago

krysal commented 1 month ago

Description

On 2024-05-08 UTC the batched_update DAG was triggered[^1] to fill the license_url in the meta_data field with its corresponding value for rows WHERE license = 'by' AND license_version = '2.0', and it reported a successful end on 2024-05-09, 17:00:18 UTC updating 746,571 records. However, after triggering a run of the add_license_url DAG on 2024-05-10, it reported the same row number missing said license, which indicates that some workflows may not be filling this field or are overwriting it.

Flicker is confirmed to be on the set of rows missing this value.

SELECT source, provider, created_on, updated_on FROM image 
 WHERE license = 'by' AND license_version = '2.0' AND meta_data->>'license_url' IS NULL LIMIT 2;

+--------+----------+-------------------------------+-------------------------------+
| source | provider | created_on                    | updated_on                    |
|--------+----------+-------------------------------+-------------------------------|
| flickr | flickr   | 2020-04-28 07:20:32.183578+00 | 2024-05-12 03:13:46.696867+00 |
| flickr | flickr   | 2020-04-28 07:08:30.821693+00 | 2024-05-12 03:13:46.696867+00 |
+--------+----------+-------------------------------+-------------------------------+

If there are more, it is to be confirmed. It is known the Flickr DAG was running those days, as well as Europeana, the Finnish Museum, Wikimedia Commons, and the Metropolitan Museum.


Screenshot of DAG reports on Thursday, May 9th. Time is in VET.

[^1]: Link only available to maintainers.

Additional context

Discovered while working on #3885.

krysal commented 3 weeks ago

On Friday I started another run of the add_license_url DAG and confirmed a relapse in a set of rows. More precisely, we got almost exactly the same count of missing license URLs that the previous run.

Starting `add_license_url` DAG. Found 8 license groups with 4919778
records to back fill `license_url` in `meta_data`.
Count per license-version:
╭────┬───────────┬───────────┬───────────╮
│    │ license   │   version │     count │
├────┼───────────┼───────────┼───────────┤
│  0 │ by        │       2.0 │   746,571 │
├────┼───────────┼───────────┼───────────┤
│  1 │ by-nc     │       2.0 │   619,675 │
├────┼───────────┼───────────┼───────────┤
│  2 │ by-nc-nd  │       2.0 │ 1,280,516 │
├────┼───────────┼───────────┼───────────┤
│  3 │ by-nc-sa  │       2.0 │ 1,316,664 │
├────┼───────────┼───────────┼───────────┤
│  4 │ by-nd     │       2.0 │   374,017 │
├────┼───────────┼───────────┼───────────┤
│  5 │ by-sa     │       2.0 │   448,449 │
├────┼───────────┼───────────┼───────────┤
│  6 │ cc0       │       1.0 │    96,493 │
├────┼───────────┼───────────┼───────────┤
│  7 │ pdm       │       1.0 │    37,350 │
╰────┴───────────┴───────────┴───────────╯

Exploring what rows are involved this is the distribution by providers and sourcers. Flickr is concentrating most of it.

openledger=> SELECT provider, source, COUNT(*) FROM image 
WHERE meta_data->>'license_url' IS NULL 
    AND NOT (license = 'pdm' AND license_version = '4.0')
GROUP BY 1, 2;

    provider     |     source      |  count
-----------------+-----------------+---------
 clevelandmuseum | clevelandmuseum |       3
 flickr          | bio_diversity   |     369
 flickr          | flickr          | 4914694
 flickr          | nasa            |     166
 met             | met             |    4500
(5 rows)

Querying the Cleveland Museum rows I confirmed the updated_on date reflects the day of the add_license_url run (duration was 8 hours) but the Cleveland DAG also ran on May 31st. The three are marked as removed_from_source=t (true) tho. More investigation is pending to find the exact cause.

-- Rows from clevelandmuseum
              identifier              | license | license_version |          created_on          |          updated_on           | removed_from_source 
--------------------------------------+---------+-----------------+------------------------------+-------------------------------+---------------------
 4ea23a17-5528-4e22-ba4f-d51bdfd1515a | cc0     | 1.0             | 2019-01-08 18:24:53.33183+00 | 2024-05-28 22:07:06.613201+00 | t
 253bdd81-64a7-458d-bdb5-a1c4e0027c1f | cc0     | 1.0             | 2019-01-08 18:24:53.33183+00 | 2024-05-28 22:06:28.443214+00 | t
 64300d6b-85f7-48ee-a84b-9870aeaaf568 | cc0     | 1.0             | 2019-08-12 16:00:40.10698+00 | 2024-05-28 22:06:28.443214+00 | t