WordPress / openverse-catalog

Identifies and collects data on cc-licensed content across web crawl data and public apis.
https://openverse.org
MIT License
59 stars 54 forks source link

Handle empty string urls for Metropolitan #1102

Closed stacimc closed 1 year ago

stacimc commented 1 year ago

Fixes

Fixes WordPress/openverse#1281 by @stacimc

Description

Previously we added handling to the Metropolitan DAG for when we receive urls that are None. It turns out we also sometimes receive empty strings.

This PR discards records with empty string urls within Metropolitan. We should consider also handling this in the media store; currently the validate_url_string method used by the media store returns None when it cannot validate a url. We could instead raise an error here. I did not do so in this PR because I did not want to make changes that apply across all DAGs before the catalog code freeze, and the fix will still need to be made in Metropolitan regardless. If folks think this change would be a good idea, I can create an issue for it.

Testing Instructions

This ordinarily takes a very long time to reproduce, but good news, I logged the id of a 'bad' record locally. You can update get_batch_data to return just the bad id. Change this line to:

return [4594,]

Run the DAG on the main branch and you should see the error:

[2023-04-11, 21:19:45 UTC] {taskinstance.py:1776} ERROR - Task failed with exception
provider_data_ingester.IngestionError: 'NoneType' object has no attribute 'split'

With the same change, run the DAG on this branch and the DAG should discard that record and pass.

Checklist

[best_practices]: https://git-scm.com/book/en/v2/Distributed-Git-Contributing-to-a-Project#_commit_guidelines

Developer Certificate of Origin

Developer Certificate of Origin ``` Developer Certificate of Origin Version 1.1 Copyright (C) 2004, 2006 The Linux Foundation and its contributors. 1 Letterman Drive Suite D4700 San Francisco, CA, 94129 Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Developer's Certificate of Origin 1.1 By making a contribution to this project, I certify that: (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved. ```