getodk / pyxform-http

Other
4 stars 10 forks source link

If file is zipped, assume xlsx #32

Closed yanokwa closed 2 years ago

yanokwa commented 2 years ago

Fixes #29

There seems to be two popular ways to detect filetypes from content in Python: python-magic and filetype. The former depends on the libmagic C library and the latter is pure Python.

I didn't want to add a dependency and it seemed (I only confirmed with filetype) that neither solution could detect XLS, but rather could only detect a ZIP. Given that it's straightforward to detect a ZIP, I pulled the code from filetype and made it into a small method.

I added pyxform-clean.xls to the test suite and made sure that worked. Also tried the various test forms.

lindsay-stevens commented 2 years ago

For detecting zip only, there's a built-in library method zipfile.is_zipfile (source here).

yanokwa commented 2 years ago

@lindsay-stevens Thanks for the tip!

I just tried for a bit and I'm having trouble getting it to work cleanly. Seems I'd either have to write the incoming file to disk, then rename based on is_zip_file.

My current approach works and doesn't feel horrible, so I'd rather not spend more time getting is_zip_file working. I'll give it one more go after I have some caffeine.

yanokwa commented 2 years ago

is_zipfile is too clever in how it detects zip files and so it won't work for us.

zipfile.is_zipfile('example.xls') # False
zipfile.is_zipfile('example.xlsx') # False
zipfile.is_zipfile('example.xlsx.zip') # True
zipfile.is_zipfile('example.xlsx.zip.foo') # True