ckan / datapusher

A standalone web service that pushes data files from a CKAN site resources into its DataStore
GNU Affero General Public License v3.0
77 stars 153 forks source link

XLSX no longer processing due to xlrd #232

Open bushong1 opened 3 years ago

bushong1 commented 3 years ago

So it looks like the dependency messytables uses xlrd for excel file processing. The latest xlrd does not support XLSX files anymore due to, as I understand it, security concerns. messytables appears to be a dead project, not having had any activity in the last 2 years. This stack overflow post says that xlrd should be swapped out for openpyxl, but with messytables being unmaintained, that seems unlikely to happen. Is there any effort being taken to support XLSX files?

anuveyatsu commented 2 years ago

@bushong1 there might be some work from our side to fix this but not yet been confirmed. Also, I'd consider replacing Datapusher with Aircan but you'd need to create a new DAG for XLSX loading.

fishbone1 commented 2 years ago

When I try to upload an XLSX-file the state remains "pending" forever, which is odd

categulario commented 2 years ago

It seems to me that the option is to replace messytables dependency with its sucesor frictionless

EricSoroos commented 2 years ago

We're also seeing that some .ods files aren't processed well by messytables, essentially causing OOM errors consuming >4G of memory. (among other reasons, it's doing zipfile extraction into memory, and potentially duplicating cells in rows many times to fill a large empty spreadsheet).

pkernevez commented 1 year ago

Any new on this issue ? I someone found a solution (like Aircan) ?

categulario commented 1 year ago

I've been using datapusher-plus in production. It has more active development and supports xlsx and ods.