alephdata / memorious

Lightweight web scraping toolkit for documents and structured data.
https://docs.alephdata.org/developers/memorious
MIT License
311 stars 59 forks source link

Normalized URLs and non-reproducibility #94

Closed moreymat closed 4 years ago

moreymat commented 4 years ago

Crawling non-normalized (fragile) URLs #87 is possible thanks to commit e595d18. However normalization is still applied to the URL used as key and stored in the JSON metadata file, for which it cannot be disabled. This means that the URL stored in the JSON file does not give access to the page that was crawled, which effectively prevents manual inspection and reproducibility.

I'm willing to write the required fixes but would like to know beforehand how you'd like to see it implemented, as my last PR #88 didn't fit your vision :-)

pudo commented 4 years ago

So you propose we get rid of normalization entirely? I'm game with that ....

moreymat commented 4 years ago

@pudo I like the idea of normalizing URLs to avoid crawling and storing copies of the same page. I was surprised to see that web servers can be fragile and permutations of parameters are not always spurious.

So I don't know if we should get rid of normalization entirely but I definitely need to be able to disable it entirely for a number of websites, and effectively this would solve that.

sunu commented 4 years ago

Closing this since I have removed url normalization completely in https://github.com/alephdata/memorious/commit/4c80713098c7d5f23ca72e8d97f015568f9b3d78. Let us know if anything else is broken @moreymat