alephdata / memorious

Lightweight web scraping toolkit for documents and structured data.
https://docs.alephdata.org/developers/memorious
MIT License
311 stars 59 forks source link

Fix weird filenames created during store stage #177

Closed simonwoerpel closed 3 years ago

simonwoerpel commented 3 years ago

Without this fix, because of using os.path.split instead of os.path.splitext, a crawled pdf file named

WD-7-028-21-pdf-data.pdf

ends up in the data folder like:

WD_7_028_21_pdf_data.WD_7_028_21_pdf_datapdf

which is probably not as it's supposed to be ;)

sunu commented 3 years ago

Hi @simonwoerpel! Thanks for noticing and fixing the bug.

The failed test is probably because of the content-type case change. After serialization the headers are sensitive to case. I'm merging this in and will fix that in a separate commit.