UTMediaCAT / mediacat-domain-crawler

Internet domain crawler
0 stars 0 forks source link

Integrate PDF capture into the domain crawler #8

Closed kstapelfeldt closed 3 years ago

kstapelfeldt commented 3 years ago

Give PDFs a name based on UUID and pass the UUID as part of the JSON object to the post processor.

https://www.npmjs.com/package/uuid

RaiyanRahman commented 3 years ago

Added PDF capture with the use of UUIDs. The UUIDs are created from the URL of the page. The namespace of all the UUIDs are set to URLs. The UUID filename is added to the JSON as an attribute for each link. In terms of storage, 20 PDFs take up around 22.4 MB.

kstapelfeldt commented 3 years ago

This is complete and will be added to the big pull request. Unfortunately, PDF generation won't work in the current architecture due to size. This will have to be made a different service that runs after post-processing (and should be configurable).

Raiyan will put it in another branch for now. Kirsta to add the PDF-grabber to architecture diagram.

jacqueline-chan commented 3 years ago

Unfortunately it seems like this new pdf feature has introduced some errors as below:

Since the pdf feature is no longer supported on the current architecture, perhaps it is best to remove it for now. To save the existing good work though, we may need to place those chances into another branch and then revert the changes on: https://github.com/UTMediaCAT/mediacat-domain-crawler/commits/%232-article-plaintext

1). Error: Evaluation failed: TypeError: Cannot read property 'innerHTML' of null errorInnerHTML.txt scope: http://calcalist.co.il/ https://www.debka.com/ Possible fix: unknown Suggestion: Try and catch maybe?

2) TypeError: Cannot read property 'URL' of undefined URLundefined.txt Possible fix: npm install again -- there could have been an issue with a library

kstapelfeldt commented 3 years ago

Closing in favour of a new ticket for a "pdf grabber service"