Several ingestion days during a recent run of metropolitan_museum_reingestion_workflow have raised the following error:
File "/opt/airflow/openverse_catalog/dags/providers/provider_api_scripts/provider_data_ingester.py", line 423, in process_batch
store.add_item(**record)
File "/opt/airflow/openverse_catalog/dags/common/storage/image.py", line 145, in add_item
image = self._get_image(**image_data)
File "/opt/airflow/openverse_catalog/dags/common/storage/image.py", line 152, in _get_image
image_metadata = self.clean_media_metadata(**kwargs)
File "/opt/airflow/openverse_catalog/dags/common/storage/media.py", line 142, in clean_media_metadata
media_data["filetype"] = self._validate_filetype(
File "/opt/airflow/openverse_catalog/dags/common/storage/media.py", line 312, in _validate_filetype
filetype = extract_filetype(url, self.media_type)
File "/opt/airflow/openverse_catalog/dags/common/extensions.py", line 9, in extract_filetype
possible_filetype = url.split(".")[-1]
AttributeError: 'NoneType' object has no attribute 'split'
It looks like somehow we're attempting to process a record with a null image_url. We should add a check for this.
Reproduction
Unfortunately in each of the production instances, the error happens > 5 hrs into the DagRun. Metropolitan only has a single batch which just gets a list of object IDs, and the bulk of the work is done in get_record_data, which makes a request for each ID. This means the current logging setup doesn't allow us a way to quickly reproduce the issue, other than doing a full run for the problematic dates. Use DAG config:
Description
Several ingestion days during a recent run of
metropolitan_museum_reingestion_workflow
have raised the following error:It looks like somehow we're attempting to process a record with a null
image_url
. We should add a check for this.Reproduction
Unfortunately in each of the production instances, the error happens > 5 hrs into the DagRun. Metropolitan only has a single batch which just gets a list of object IDs, and the bulk of the work is done in
get_record_data
, which makes a request for each ID. This means the current logging setup doesn't allow us a way to quickly reproduce the issue, other than doing a full run for the problematic dates. Use DAG config: