Open krysal opened 4 months ago
Since Europeana is an aggregator, I suspect that all of the images from this particular source might have been affected (given they're all producing the same not_found.txt
thumbnail: https://www.muis.ee/digitaalhoidla/api/meedia/originaal?id=7c0829e9-1731-4ad1-894f-7980bb09f3c7).
I've run the following to see how pervasive this issue is:
deploy@localhost:openledger> select count(*) from image where provider='europeana' and STARTS_WITH(url, 'https://www.muis.ee');
+--------+
| count |
|--------|
| 143600 |
+--------+
SELECT 1
Time: 313.911s (5 minutes 13 seconds), executed in: 313.905s (5 minutes 13 seconds)
This seems like something that could be addressed in a batched update, if we could figure out how to correct the URLs!
Diving into the result above, it looks like all of the related URLs differ now:
Because these are all unique UUIDs, it doesn't look like we can derive those values in a way that could be updated using the batched update 😞 Maybe the best option would be to use the additional_query_parameters
added in #3648 to select only images from this domain (Estonian National Museum
) and reingest those specifically to get the new URLs? What do you think @WordPress/openverse-catalog?
Maybe the best option would be to use the additional_query_parameters added in https://github.com/WordPress/openverse/pull/3648 to select only images from this domain (Estonian National Museum) and reingest those specifically to get the new URLs?
Following up in this thread from an in-person conversation: I think this sounds good, but noting that because Europeana does not have a traditional reingestion DAG we'd want to look into whether there's a reasonable range of dates we could re-run the DAG for to cover all images from this domain.
I believe I've found a suitable additional_query_parameters
that will allow us to select only the Estonian National Museum data! Currently the dated portion of the DAG configuration goes directly into the query
field - this is exactly the field that we can override with the additional_query_parameters
! That means that it doesn't really matter for us that the DAG is dated in this case 😄 I tested an API call with the following and it seemed to work, currently running a locally triggered DAG now with these values and will share if that works.
additional_query_parameters
override: {'query': 'DATA_PROVIDER:("Estonian National Museum")'}
Confirmed that that should work! I ran this locally and ingested 250 records, all of which were from the Estonian National Museum. We should be able to run this triggered DAG next week!
openledger> select count(*) from image where provider='europeana';
+-------+
| count |
|-------|
| 250 |
+-------+
SELECT 1
Time: 0.023s
openledger> select identifier, meta_data from image where provider='europeana' limit 10;
+--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------+
| identifier | meta_data
|
|--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------|
| ce8a15ad-712f-4434-8b9c-d97b89b8f7a8 | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"} |
| 1697095d-74b9-46c9-92d7-1f7443a87b90 | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"} |
| a136209d-eb0d-44ea-9e68-7e27143e1581 | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"} |
| 196db965-be4c-4944-bcea-61c47592b4a4 | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"} |
| 07186a32-4e56-49c1-bc1c-c69cbcb03448 | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"} |
| ec680b1a-29af-4138-a05e-6e5e3eb1ce55 | {"country": ["Estonia"], "description": "sündmuse kommentaar: Eesti Apostliku Õigeusu kiriku Tartu Püha Aleksandri kogudus", "license_url": "https://creativecommons.org/pu
blicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license_url": "http://creativecommons.org/publicdomain/zero/1.0/"} |
| 4b2cca42-80f4-412e-b355-1c3efa06aa3b | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"} |
| 5fe5cf4d-30cc-46f4-85dd-6420ac7b04c2 | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"} |
| 84d18904-f90c-41d9-ace6-279b9a4e946e | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"} |
| ce2ebde1-c4fd-418e-b343-d191ea984b14 | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"} |
+--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------+
SELECT 10
The ingestion completed (DAG run link), but only ingested 21,020 records 😕 We did get a data refresh after that, but even with the updated record the primary URL is still showing the "not found" thumbnail 😖 https://api.openverse.engineering/v1/images/f8c86a20-eb9c-4ffc-9a06-3664151dbce6/
I'm going to try and see if I can isolate this exact result in a query, and see if Europeana is giving us incorrect URLs.
I've narrowed down a set of query parameters that reflects the images affected above:
{'wskey': '[redacted]',
'profile': 'rich',
'reusability': ['open', 'restricted'],
'sort': ['europeana_id+desc', 'timestamp_created+desc'],
'rows': '100',
'media': 'true',
'start': 1,
'qf': ['TYPE:IMAGE',
'provider_aggregation_edm_isShownBy:*',
'DATA_PROVIDER:("Estonian National Museum")'],
'query': 'varrukad, hame, naiste',
'cursor': '*'}
This returns 7 results, one of which is the result shared above, here's the full contents of the response body:
{'completeness': 5,
'country': ['Estonia'],
'dataProvider': ['Estonian National Museum'],
'dcCreator': ['Danilova, Marfa (valmistaja)'],
'dcCreatorLangAware': {'et': ['Danilova, Marfa (valmistaja)']},
'dcSubjectLangAware': {'def': ['http://data.europeana.eu/concept/2585'],
'et': ['särk']},
'dcTitleLangAware': {'en': ['sleeves, hame, women'],
'et': ['varrukad, hame, naiste']},
'dcTypeLangAware': {'def': ['http://data.europeana.eu/concept/2585'],
'et': ['särk']},
'edmConcept': ['http://data.europeana.eu/concept/2585'],
'edmConceptLabel': [{'def': 'Hemd'},
{'def': 'Рубашка'},
{'def': 'Paita'},
{'def': 'Camisa'},
{'def': 'Риза'},
{'def': 'Marškiniai'},
{'def': 'Krekls'},
{'def': 'Košulja'},
{'def': 'Chemise'},
{'def': 'Ing'},
{'def': 'Košeľa'},
{'def': 'Léine'},
{'def': 'Camisa'},
{'def': 'Skjorta'},
{'def': 'Πουκάμισο'},
{'def': 'Shirt'},
{'def': 'Camicia'},
{'def': 'Camisa'},
{'def': 'Särk'},
{'def': 'Alkandora'},
{'def': 'Košile'},
{'def': 'Koszula'},
{'def': 'Cămașă'},
{'def': 'Skjorte'},
{'def': 'Overhemd'}],
'edmConceptPrefLabelLangAware': {'de': ['Hemd'],
'ru': ['Рубашка'],
'fi': ['Paita'],
'pt': ['Camisa'],
'bg': ['Риза'],
'lt': ['Marškiniai'],
'lv': ['Krekls'],
'hr': ['Košulja'],
'fr': ['Chemise'],
'hu': ['Ing'],
'sk': ['Košeľa'],
'ga': ['Léine'],
'ca': ['Camisa'],
'sv': ['Skjorta'],
'el': ['Πουκάμισο'],
'en': ['Shirt'],
'it': ['Camicia'],
'es': ['Camisa'],
'et': ['Särk'],
'eu': ['Alkandora'],
'cs': ['Košile'],
'pl': ['Koszula'],
'ro': ['Cămașă'],
'da': ['Skjorte'],
'nl': ['Overhemd']},
'edmDatasetName': ['401_Muuseumid'],
'edmIsShownAt': ['https://www.muis.ee/museaalView/534165'],
'edmIsShownBy': ['https://www.muis.ee/digitaalhoidla/api/meedia/originaal?id=7c0829e9-1731-4ad1-894f-7980bb09f3c7'],
'edmPreview': ['https://api.europeana.eu/thumbnail/v2/url.json?uri=https%3A%2F%2Fwww.muis.ee%2Fdigitaalhoidla%2Fapi%2Fmeedia%2Foriginaal%3Fid%3D7c0829e9-1731-4ad1-894f-7980bb09f3c7&type=IMAGE'],
'europeanaCollectionName': ['401_Muuseumid'],
'europeanaCompleteness': 5,
'guid': 'https://www.europeana.eu/item/401/item_O55A2YTA2TMLVLCDBPCUIPSNMBJEJTRT?utm_source=api&utm_medium=api&utm_campaign=dialialika',
'id': '/401/item_O55A2YTA2TMLVLCDBPCUIPSNMBJEJTRT',
'index': 0,
'language': ['et'],
'link': 'https://api.europeana.eu/record/401/item_O55A2YTA2TMLVLCDBPCUIPSNMBJEJTRT.json?wskey=dialialika',
'organizations': ['http://data.europeana.eu/organization/1482250000000435049',
'http://data.europeana.eu/organization/1482250000026719048'],
'previewNoDistribute': False,
'provider': ['Estonian e-Repository and Conservation of Collections'],
'rights': ['http://creativecommons.org/publicdomain/zero/1.0/'],
'score': 246.3116,
'timestamp': 1688490887425,
'timestamp_created': '2022-05-10T08:10:51.546Z',
'timestamp_created_epoch': 1652170251546,
'timestamp_update': '2022-05-10T08:10:51.546Z',
'timestamp_update_epoch': 1652170251546,
'title': ['varrukad, hame, naiste', 'sleeves, hame, women'],
'type': 'IMAGE',
'ugc': [False]}
We use the edmIsShownBy
value for our URL, and indeed this value which is returned from Europeana is redirecting to the "not found" image. @Hobbesball - would you happen to have any insight on this?
I've emailed the folks at Europeana directly to ask them about this issue.
Description
Apparently, some Europeana images can change their direct link while remaining available through their landing page. This is a problem for us because it seems the Data Refresh process is not updating this value (I haven't confirmed it).
Observe this image for example: https://api.openverse.engineering/v1/images/f8c86a20-eb9c-4ffc-9a06-3664151dbce6/
Reproduction
Screenshots
Additional context
Sentry issue.