Previously we added handling to the Metropolitan DAG for when we receive urls that are None. It turns out we also sometimes receive empty strings.
This PR discards records with empty string urls within Metropolitan. We should consider also handling this in the media store; currently the validate_url_string method used by the media store returns None when it cannot validate a url. We could instead raise an error here. I did not do so in this PR because I did not want to make changes that apply across all DAGs before the catalog code freeze, and the fix will still need to be made in Metropolitan regardless. If folks think this change would be a good idea, I can create an issue for it.
Testing Instructions
This ordinarily takes a very long time to reproduce, but good news, I logged the id of a 'bad' record locally. You can update get_batch_data to return just the bad id. Change this line to:
return [4594,]
Run the DAG on the main branch and you should see the error:
[2023-04-11, 21:19:45 UTC] {taskinstance.py:1776} ERROR - Task failed with exception
provider_data_ingester.IngestionError: 'NoneType' object has no attribute 'split'
With the same change, run the DAG on this branch and the DAG should discard that record and pass.
Checklist
[ ] My pull request has a descriptive title (not a vague title like
Update index.md).
[ ] My pull request targets the default branch of the repository (main) or
a parent feature branch.
[ ] My commit messages follow [best practices][best_practices].
[ ] My code follows the established code style of the repository.
[ ] I added or updated tests for the changes I made (if applicable).
[ ] I added or updated documentation (if applicable).
[ ] I tried running the project locally and verified that there are no visible
errors.
Developer Certificate of Origin
```
Developer Certificate of Origin
Version 1.1
Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129
Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.
Developer's Certificate of Origin 1.1
By making a contribution to this project, I certify that:
(a) The contribution was created in whole or in part by me and I
have the right to submit it under the open source license
indicated in the file; or
(b) The contribution is based upon previous work that, to the best
of my knowledge, is covered under an appropriate open source
license and I have the right under that license to submit that
work with modifications, whether created in whole or in part
by me, under the same open source license (unless I am
permitted to submit under a different license), as indicated
in the file; or
(c) The contribution was provided directly to me by some other
person who certified (a), (b) or (c) and I have not modified
it.
(d) I understand and agree that this project and the contribution
are public and that a record of the contribution (including all
personal information I submit with it, including my sign-off) is
maintained indefinitely and may be redistributed consistent with
this project or the open source license(s) involved.
```
Fixes
Fixes WordPress/openverse#1281 by @stacimc
Description
Previously we added handling to the Metropolitan DAG for when we receive urls that are
None
. It turns out we also sometimes receive empty strings.This PR discards records with empty string urls within Metropolitan. We should consider also handling this in the media store; currently the
validate_url_string
method used by the media store returnsNone
when it cannot validate a url. We could instead raise an error here. I did not do so in this PR because I did not want to make changes that apply across all DAGs before the catalog code freeze, and the fix will still need to be made in Metropolitan regardless. If folks think this change would be a good idea, I can create an issue for it.Testing Instructions
This ordinarily takes a very long time to reproduce, but good news, I logged the id of a 'bad' record locally. You can update
get_batch_data
to return just the bad id. Change this line to:Run the DAG on the
main
branch and you should see the error:With the same change, run the DAG on this branch and the DAG should discard that record and pass.
Checklist
Update index.md
).main
) or a parent feature branch.[best_practices]: https://git-scm.com/book/en/v2/Distributed-Git-Contributing-to-a-Project#_commit_guidelines
Developer Certificate of Origin
Developer Certificate of Origin
``` Developer Certificate of Origin Version 1.1 Copyright (C) 2004, 2006 The Linux Foundation and its contributors. 1 Letterman Drive Suite D4700 San Francisco, CA, 94129 Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Developer's Certificate of Origin 1.1 By making a contribution to this project, I certify that: (a) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or (b) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or (c) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. (d) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved. ```