WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
214 stars 176 forks source link

Europeana images may change the direct URL which cause broken images to be displayed in Openverse #3772

Open krysal opened 4 months ago

krysal commented 4 months ago

Description

Apparently, some Europeana images can change their direct link while remaining available through their landing page. This is a problem for us because it seems the Data Refresh process is not updating this value (I haven't confirmed it).

Observe this image for example: https://api.openverse.engineering/v1/images/f8c86a20-eb9c-4ffc-9a06-3664151dbce6/

{
    "id": "f8c86a20-eb9c-4ffc-9a06-3664151dbce6",
    "title": "varrukad, hame, naiste",
    "indexed_on": "2022-11-20T17:08:56.418834Z",
    "foreign_landing_url": "https://www.muis.ee/museaalView/534165",
    "url": "https://www.muis.ee/digitaalhoidla/api/meedia/originaal?id=7c0829e9-1731-4ad1-894f-7980bb09f3c7",
    "creator": null,
    "creator_url": null,
    "license": "cc0",
    "license_version": "1.0",
    "license_url": "https://creativecommons.org/publicdomain/zero/1.0/",
    "provider": "europeana",
    "source": "europeana",
    "category": null,
    "filesize": null,
    "filetype": null,
    "tags": [],
    "attribution": "\"varrukad, hame, naiste\" is marked with CC0 1.0. To view the terms, visit https://creativecommons.org/publicdomain/zero/1.0/.",
    "fields_matched": [],
    "mature": false,
    "height": null,
    "width": null,
    "thumbnail": "https://api.openverse.engineering/v1/images/f8c86a20-eb9c-4ffc-9a06-3664151dbce6/thumb/",
    "detail_url": "https://api.openverse.engineering/v1/images/f8c86a20-eb9c-4ffc-9a06-3664151dbce6/",
    "related_url": "https://api.openverse.engineering/v1/images/f8c86a20-eb9c-4ffc-9a06-3664151dbce6/related/",
    "unstable__sensitivity": []
}

Reproduction

  1. Go to https://openverse.org/search/image?q=varrukad,%20hame,%20naiste
  2. See all images found have broken thumbnails.

Screenshots

CleanShot 2024-02-08 at 17 17 39@2x

Additional context

Sentry issue.

AetherUnbound commented 4 months ago

Since Europeana is an aggregator, I suspect that all of the images from this particular source might have been affected (given they're all producing the same not_found.txt thumbnail: https://www.muis.ee/digitaalhoidla/api/meedia/originaal?id=7c0829e9-1731-4ad1-894f-7980bb09f3c7).

I've run the following to see how pervasive this issue is:

deploy@localhost:openledger> select count(*) from image where provider='europeana' and STARTS_WITH(url, 'https://www.muis.ee');
+--------+
| count  |
|--------|
| 143600 |
+--------+
SELECT 1
Time: 313.911s (5 minutes 13 seconds), executed in: 313.905s (5 minutes 13 seconds)

This seems like something that could be addressed in a batched update, if we could figure out how to correct the URLs!

AetherUnbound commented 4 months ago

Diving into the result above, it looks like all of the related URLs differ now:

Because these are all unique UUIDs, it doesn't look like we can derive those values in a way that could be updated using the batched update 😞 Maybe the best option would be to use the additional_query_parameters added in #3648 to select only images from this domain (Estonian National Museum) and reingest those specifically to get the new URLs? What do you think @WordPress/openverse-catalog?

stacimc commented 4 months ago

Maybe the best option would be to use the additional_query_parameters added in https://github.com/WordPress/openverse/pull/3648 to select only images from this domain (Estonian National Museum) and reingest those specifically to get the new URLs?

Following up in this thread from an in-person conversation: I think this sounds good, but noting that because Europeana does not have a traditional reingestion DAG we'd want to look into whether there's a reasonable range of dates we could re-run the DAG for to cover all images from this domain.

AetherUnbound commented 4 months ago

I believe I've found a suitable additional_query_parameters that will allow us to select only the Estonian National Museum data! Currently the dated portion of the DAG configuration goes directly into the query field - this is exactly the field that we can override with the additional_query_parameters! That means that it doesn't really matter for us that the DAG is dated in this case 😄 I tested an API call with the following and it seemed to work, currently running a locally triggered DAG now with these values and will share if that works.

additional_query_parameters override: {'query': 'DATA_PROVIDER:("Estonian National Museum")'}

AetherUnbound commented 4 months ago

Confirmed that that should work! I ran this locally and ingested 250 records, all of which were from the Estonian National Museum. We should be able to run this triggered DAG next week!

openledger> select count(*) from image where provider='europeana';
+-------+
| count |
|-------|
| 250   |
+-------+
SELECT 1
Time: 0.023s
openledger> select identifier, meta_data from image where provider='europeana' limit 10;
+--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------+
| identifier                           | meta_data                                                                                                                                                                  
                                                                                                                                             |
|--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------|
| ce8a15ad-712f-4434-8b9c-d97b89b8f7a8 | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"}                                                                                  |
| 1697095d-74b9-46c9-92d7-1f7443a87b90 | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"}                                                                                  |
| a136209d-eb0d-44ea-9e68-7e27143e1581 | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"}                                                                                  |
| 196db965-be4c-4944-bcea-61c47592b4a4 | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"}                                                                                  |
| 07186a32-4e56-49c1-bc1c-c69cbcb03448 | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"}                                                                                  |
| ec680b1a-29af-4138-a05e-6e5e3eb1ce55 | {"country": ["Estonia"], "description": "sündmuse kommentaar: Eesti Apostliku Õigeusu kiriku Tartu Püha Aleksandri kogudus", "license_url": "https://creativecommons.org/pu
blicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license_url": "http://creativecommons.org/publicdomain/zero/1.0/"} |
| 4b2cca42-80f4-412e-b355-1c3efa06aa3b | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"}                                                                                  |
| 5fe5cf4d-30cc-46f4-85dd-6420ac7b04c2 | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"}                                                                                  |
| 84d18904-f90c-41d9-ace6-279b9a4e946e | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"}                                                                                  |
| ce2ebde1-c4fd-418e-b343-d191ea984b14 | {"country": ["Estonia"], "description": "", "license_url": "https://creativecommons.org/publicdomain/zero/1.0/", "dataProvider": ["Estonian National Museum"], "raw_license
_url": "http://creativecommons.org/publicdomain/zero/1.0/"}                                                                                  |
+--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------+
SELECT 10
AetherUnbound commented 3 months ago

The ingestion completed (DAG run link), but only ingested 21,020 records 😕 We did get a data refresh after that, but even with the updated record the primary URL is still showing the "not found" thumbnail 😖 https://api.openverse.engineering/v1/images/f8c86a20-eb9c-4ffc-9a06-3664151dbce6/

I'm going to try and see if I can isolate this exact result in a query, and see if Europeana is giving us incorrect URLs.

AetherUnbound commented 3 months ago

I've narrowed down a set of query parameters that reflects the images affected above:

{'wskey': '[redacted]',
 'profile': 'rich',
 'reusability': ['open', 'restricted'],
 'sort': ['europeana_id+desc', 'timestamp_created+desc'],
 'rows': '100',
 'media': 'true',
 'start': 1,
 'qf': ['TYPE:IMAGE',
  'provider_aggregation_edm_isShownBy:*',
  'DATA_PROVIDER:("Estonian National Museum")'],
 'query': 'varrukad, hame, naiste',
 'cursor': '*'}

This returns 7 results, one of which is the result shared above, here's the full contents of the response body:

{'completeness': 5,
 'country': ['Estonia'],
 'dataProvider': ['Estonian National Museum'],
 'dcCreator': ['Danilova, Marfa (valmistaja)'],
 'dcCreatorLangAware': {'et': ['Danilova, Marfa (valmistaja)']},
 'dcSubjectLangAware': {'def': ['http://data.europeana.eu/concept/2585'],
  'et': ['särk']},
 'dcTitleLangAware': {'en': ['sleeves, hame, women'],
  'et': ['varrukad, hame, naiste']},
 'dcTypeLangAware': {'def': ['http://data.europeana.eu/concept/2585'],
  'et': ['särk']},
 'edmConcept': ['http://data.europeana.eu/concept/2585'],
 'edmConceptLabel': [{'def': 'Hemd'},
  {'def': 'Рубашка'},
  {'def': 'Paita'},
  {'def': 'Camisa'},
  {'def': 'Риза'},
  {'def': 'Marškiniai'},
  {'def': 'Krekls'},
  {'def': 'Košulja'},
  {'def': 'Chemise'},
  {'def': 'Ing'},
  {'def': 'Košeľa'},
  {'def': 'Léine'},
  {'def': 'Camisa'},
  {'def': 'Skjorta'},
  {'def': 'Πουκάμισο'},
  {'def': 'Shirt'},
  {'def': 'Camicia'},
  {'def': 'Camisa'},
  {'def': 'Särk'},
  {'def': 'Alkandora'},
  {'def': 'Košile'},
  {'def': 'Koszula'},
  {'def': 'Cămașă'},
  {'def': 'Skjorte'},
  {'def': 'Overhemd'}],
 'edmConceptPrefLabelLangAware': {'de': ['Hemd'],
  'ru': ['Рубашка'],
  'fi': ['Paita'],
  'pt': ['Camisa'],
  'bg': ['Риза'],
  'lt': ['Marškiniai'],
  'lv': ['Krekls'],
  'hr': ['Košulja'],
  'fr': ['Chemise'],
  'hu': ['Ing'],
  'sk': ['Košeľa'],
  'ga': ['Léine'],
  'ca': ['Camisa'],
  'sv': ['Skjorta'],
  'el': ['Πουκάμισο'],
  'en': ['Shirt'],
  'it': ['Camicia'],
  'es': ['Camisa'],
  'et': ['Särk'],
  'eu': ['Alkandora'],
  'cs': ['Košile'],
  'pl': ['Koszula'],
  'ro': ['Cămașă'],
  'da': ['Skjorte'],
  'nl': ['Overhemd']},
 'edmDatasetName': ['401_Muuseumid'],
 'edmIsShownAt': ['https://www.muis.ee/museaalView/534165'],
 'edmIsShownBy': ['https://www.muis.ee/digitaalhoidla/api/meedia/originaal?id=7c0829e9-1731-4ad1-894f-7980bb09f3c7'],
 'edmPreview': ['https://api.europeana.eu/thumbnail/v2/url.json?uri=https%3A%2F%2Fwww.muis.ee%2Fdigitaalhoidla%2Fapi%2Fmeedia%2Foriginaal%3Fid%3D7c0829e9-1731-4ad1-894f-7980bb09f3c7&type=IMAGE'],
 'europeanaCollectionName': ['401_Muuseumid'],
 'europeanaCompleteness': 5,
 'guid': 'https://www.europeana.eu/item/401/item_O55A2YTA2TMLVLCDBPCUIPSNMBJEJTRT?utm_source=api&utm_medium=api&utm_campaign=dialialika',
 'id': '/401/item_O55A2YTA2TMLVLCDBPCUIPSNMBJEJTRT',
 'index': 0,
 'language': ['et'],
 'link': 'https://api.europeana.eu/record/401/item_O55A2YTA2TMLVLCDBPCUIPSNMBJEJTRT.json?wskey=dialialika',
 'organizations': ['http://data.europeana.eu/organization/1482250000000435049',
  'http://data.europeana.eu/organization/1482250000026719048'],
 'previewNoDistribute': False,
 'provider': ['Estonian e-Repository and Conservation of Collections'],
 'rights': ['http://creativecommons.org/publicdomain/zero/1.0/'],
 'score': 246.3116,
 'timestamp': 1688490887425,
 'timestamp_created': '2022-05-10T08:10:51.546Z',
 'timestamp_created_epoch': 1652170251546,
 'timestamp_update': '2022-05-10T08:10:51.546Z',
 'timestamp_update_epoch': 1652170251546,
 'title': ['varrukad, hame, naiste', 'sleeves, hame, women'],
 'type': 'IMAGE',
 'ugc': [False]}

We use the edmIsShownBy value for our URL, and indeed this value which is returned from Europeana is redirecting to the "not found" image. @Hobbesball - would you happen to have any insight on this?

AetherUnbound commented 2 months ago

I've emailed the folks at Europeana directly to ask them about this issue.