alephdata / memorious

Lightweight web scraping toolkit for documents and structured data.
https://docs.alephdata.org/developers/memorious
MIT License
311 stars 59 forks source link

FIX store mimetype #86

Closed moreymat closed 4 years ago

moreymat commented 4 years ago

This PR extracts the MIME type from the (possibly longer) Content-Type. Extra elements (eg. charset) cause mimetypes.guess_extension() to return None, so we drop them.

moreymat commented 4 years ago

I initially wanted to base this PR on develop but develop is currently 7 commits behind master.

pudo commented 4 years ago

I'm worried that this will break if there is ever a response without a Content-Type header (i.e. where it's None). Regarding the parsing of the headers with extra params, that's also implemented in pantomime and might be better to use from there:

from pantomime import normalize_mimetype
mime_type = normalize_mimetype(headers.get('Content-Type'))

i.e.:

out = normalize_mimetype('text/html; encoding=utf-8')
assert out == 'text/html'
moreymat commented 4 years ago

@pudo good point, should be fixed now :-)