Open krysal opened 1 month ago
On Friday I started another run of the add_license_url
DAG and confirmed a relapse in a set of rows. More precisely, we got almost exactly the same count of missing license URLs that the previous run.
Starting `add_license_url` DAG. Found 8 license groups with 4919778
records to back fill `license_url` in `meta_data`.
Count per license-version:
╭────┬───────────┬───────────┬───────────╮
│ │ license │ version │ count │
├────┼───────────┼───────────┼───────────┤
│ 0 │ by │ 2.0 │ 746,571 │
├────┼───────────┼───────────┼───────────┤
│ 1 │ by-nc │ 2.0 │ 619,675 │
├────┼───────────┼───────────┼───────────┤
│ 2 │ by-nc-nd │ 2.0 │ 1,280,516 │
├────┼───────────┼───────────┼───────────┤
│ 3 │ by-nc-sa │ 2.0 │ 1,316,664 │
├────┼───────────┼───────────┼───────────┤
│ 4 │ by-nd │ 2.0 │ 374,017 │
├────┼───────────┼───────────┼───────────┤
│ 5 │ by-sa │ 2.0 │ 448,449 │
├────┼───────────┼───────────┼───────────┤
│ 6 │ cc0 │ 1.0 │ 96,493 │
├────┼───────────┼───────────┼───────────┤
│ 7 │ pdm │ 1.0 │ 37,350 │
╰────┴───────────┴───────────┴───────────╯
Exploring what rows are involved this is the distribution by providers and sourcers. Flickr is concentrating most of it.
openledger=> SELECT provider, source, COUNT(*) FROM image
WHERE meta_data->>'license_url' IS NULL
AND NOT (license = 'pdm' AND license_version = '4.0')
GROUP BY 1, 2;
provider | source | count
-----------------+-----------------+---------
clevelandmuseum | clevelandmuseum | 3
flickr | bio_diversity | 369
flickr | flickr | 4914694
flickr | nasa | 166
met | met | 4500
(5 rows)
Querying the Cleveland Museum rows I confirmed the updated_on
date reflects the day of the add_license_url
run (duration was 8 hours) but the Cleveland DAG also ran on May 31st. The three are marked as removed_from_source=t
(true) tho. More investigation is pending to find the exact cause.
-- Rows from clevelandmuseum
identifier | license | license_version | created_on | updated_on | removed_from_source
--------------------------------------+---------+-----------------+------------------------------+-------------------------------+---------------------
4ea23a17-5528-4e22-ba4f-d51bdfd1515a | cc0 | 1.0 | 2019-01-08 18:24:53.33183+00 | 2024-05-28 22:07:06.613201+00 | t
253bdd81-64a7-458d-bdb5-a1c4e0027c1f | cc0 | 1.0 | 2019-01-08 18:24:53.33183+00 | 2024-05-28 22:06:28.443214+00 | t
64300d6b-85f7-48ee-a84b-9870aeaaf568 | cc0 | 1.0 | 2019-08-12 16:00:40.10698+00 | 2024-05-28 22:06:28.443214+00 | t
Description
On 2024-05-08 UTC the
batched_update
DAG was triggered[^1] to fill thelicense_url
in themeta_data
field with its corresponding value for rowsWHERE license = 'by' AND license_version = '2.0'
, and it reported a successful end on 2024-05-09, 17:00:18 UTC updating 746,571 records. However, after triggering a run of theadd_license_url
DAG on 2024-05-10, it reported the same row number missing said license, which indicates that some workflows may not be filling this field or are overwriting it.Flicker is confirmed to be on the set of rows missing this value.
If there are more, it is to be confirmed. It is known the Flickr DAG was running those days, as well as Europeana, the Finnish Museum, Wikimedia Commons, and the Metropolitan Museum.
[^1]: Link only available to maintainers.
Additional context
Discovered while working on #3885.