alephdata / memorious

Lightweight web scraping toolkit for documents and structured data.
https://docs.alephdata.org/developers/memorious
MIT License
311 stars 59 forks source link

FIX normalize_url optional everywhere #88

Closed moreymat closed 4 years ago

moreymat commented 4 years ago

This PR makes normalize_url truly optional to solve issue #87 .

Concretely, the parse operation now accepts a param normalize_url: False that disables the normalization of the URLs parsed in a webpage. The second change is that the normalize_url param for fetch now affects, in addition to the request URL, the response URL (via memorious.logic.http.ContextHttpResponse).

pudo commented 4 years ago

I'm curious if we shouldn't try to fix normalize_url instead to make sure it generates valid URLs in these cases...

moreymat commented 4 years ago

@pudo in my case, the URLs generated by normalize_url were syntactically valid but resulting in a 404-like page from the server. I don't see how one could infer or anticipate this, but I'm no expert in web programming...