Manage PDF scraping - Githubissues

Data4Democracy / internal-displacement

Studying news events and internal displacement.

43 stars 27 forks source link

Manage PDF scraping #43

Closed georgerichardson closed 7 years ago

georgerichardson commented 7 years ago

PDF scraping could get a bit intensive hard disk style and could slow down scraping if doing a bulk load of urls. Can we:

Have the option to turn off pdf scraping. What part of the code should control this?
Delete a pdf as soon as it has been downloaded and parsed

coldfashioned commented 7 years ago

re: #2 - Short answer, Yes. adding "os.remove('path/to/file/filename.pdf') somewhere after the file has been parsed will delete the file from the directory. http://stackoverflow.com/questions/6996603/delete-a-file-or-folder-in-python

simonb83 commented 7 years ago

As for turning off pdf scraping, maybe a flag that can be passed to scrape(url) in scraper.py which can control whether or not to go ahead and scrape the pdf or just ignore it

georgerichardson commented 7 years ago

Yeah, doesn't need to be anything fancy.

This makes me think of something else though. scrape(url) is called from SQLArticleInterface() and so if we pass any flags to the scraper, they have to also be passed into the Interface which seems a bit redundant. Does it maybe make more sense to have the Interface called by the scraper at the end of scraping any article instead, or something else? Or not a big deal?

jlln commented 7 years ago

I think the best place to insert the flags would be the process_urls function of SQLArticleInterface. Because scraping is concurrent (i.e. it will always be called by an outer function), we cannot avoid having to pass flags to an outer function.

georgerichardson commented 7 years ago

Yeah, that's true. That's a place for it that makes sense too.

georgerichardson commented 7 years ago

@coldfashioned I just looked at your code again and realised it overwrites every time, so there's not much chance of a large build up of pdfs on disk anyway

simonb83 commented 7 years ago

@georgerichardson if at some point we end up processing pdfs concurrently, do we need to worry about them being saved to disk with the same name?

coldfashioned commented 7 years ago

Thanks - I wrote it that way on purpose, so it didn't just create a billion pdfs on my system. I don't think there is a way to feed an HTTP response to textract. If you want to process concurrently, you could add an index to the file name pretty easily. something like:

return os.path.join('./','file_to_convert', index, '.pdf')

The index could be fed in with the function call, or created / incremented inside the function. Or, skip that altogether and just save the file using the url name + .pdf. Just a couple thoughts.