alephdata / memorious

Lightweight web scraping toolkit for documents and structured data.
https://docs.alephdata.org/developers/memorious
MIT License
311 stars 59 forks source link

documentcloud integration may need to be reviewed #168

Closed Rosencrantz closed 3 years ago

Rosencrantz commented 3 years ago

@sunu, @pudo I Identified this whilst testing another aspect of memorious. The documentcloud test is currently failing. Reproducing the issue is a simple case of running the tests and waiting for test_documentcloud.py test to fail. I've confirmed that the test is failing on master and on my own branch of memorious.

The cause of the test failure looks to be a missing document https://documentcloud.org/search/documents.json the file is not available when logged in or logged out.

Trying to determine the root cause I suspect that this file has been removed and replaced with an api that is available at: https://api.www.documentcloud.org/api/documents/search. This new api still returns json, and can still be paginated. However there are some elements that have changed. Specifically a number of fields that were used seem to have been removed:

It seems like it is still possible to determine a link to the original document by adding the document id to the title, whether that will work for every file though is currently unknown (to me). Other information might be able to be deduced from information like the original filetype or from searching within a document.

Anyway, I suspect that the documentcloud integration that currently exists in memorious might need to be reviewed... Hope this helps.

sunu commented 3 years ago

Hi @Rosencrantz, thanks for looking into it. Looks like the documentcloud rewrite went live a few days ago: https://twitter.com/dylfreed/status/1370083329010237441

I took a brief look at the new API. For pdf documents, looks like we can get the source file url by combining the document id and slug (https://github.com/MuckRock/python-documentcloud/blob/master/documentcloud/documents.py#L194). And we can get the frontend canonical url similarly too (https://github.com/MuckRock/documentcloud/blob/488c5c092b24aa8b54cdccafbf990728fee2a877/documentcloud/documents/serializers.py#L292). And as you said, most of the other information can be deduced from different fields.

Rosencrantz commented 3 years ago

@sunu Thanks for confirming. I should be able to fix this up, if you're happy for me to take a look?

sunu commented 3 years ago

@Rosencrantz great, please go ahead!

Rosencrantz commented 3 years ago

@sunu There is a pull request open and waiting for you!