WordPress / openverse

Openverse is a search engine for openly-licensed media. This monorepo includes all application code.
https://openverse.org
MIT License
247 stars 199 forks source link

Metropolitan receives records with `None` url #1281

Closed stacimc closed 1 year ago

stacimc commented 1 year ago

Description

Several ingestion days during a recent run of metropolitan_museum_reingestion_workflow have raised the following error:

  File "/opt/airflow/openverse_catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 423, in process_batch
    store.add_item(**record)
  File "/opt/airflow/openverse_catalog/dags/common/storage/image.py", line 145, in add_item
    image = self._get_image(**image_data)
  File "/opt/airflow/openverse_catalog/dags/common/storage/image.py", line 152, in _get_image
    image_metadata = self.clean_media_metadata(**kwargs)
  File "/opt/airflow/openverse_catalog/dags/common/storage/media.py", line 142, in clean_media_metadata
    media_data["filetype"] = self._validate_filetype(
  File "/opt/airflow/openverse_catalog/dags/common/storage/media.py", line 312, in _validate_filetype
    filetype = extract_filetype(url, self.media_type)
  File "/opt/airflow/openverse_catalog/dags/common/extensions.py", line 9, in extract_filetype
    possible_filetype = url.split(".")[-1]
AttributeError: 'NoneType' object has no attribute 'split'

It looks like somehow we're attempting to process a record with a null image_url. We should add a check for this.

Reproduction

Unfortunately in each of the production instances, the error happens > 5 hrs into the DagRun. Metropolitan only has a single batch which just gets a list of object IDs, and the bulk of the work is done in get_record_data, which makes a request for each ID. This means the current logging setup doesn't allow us a way to quickly reproduce the issue, other than doing a full run for the problematic dates. Use DAG config:

{
"initial_query_params":{"metadataDate":"2022-06-18"}
}
AetherUnbound commented 1 year ago

We're still seeing this issue in production it seems: