HXLStandard / libhxl-python

Python support library for the Humanitarian Exchange Language (HXL) data standard.
The Unlicense
41 stars 11 forks source link

Investigate using tabulator instead of xlrd etc. #93

Closed mcarans closed 5 years ago

mcarans commented 7 years ago

As libhxl streams data in various formats it may be sensible, if feasible, to use https://github.com/frictionlessdata/tabulator-py rather than handling each format (csv, xls, xlsx, json etc.) within libhxl.

Then issues like handling merged cells in xls, xlsx (https://github.com/frictionlessdata/tabulator-py/issues/171) and handling multi line headers (https://github.com/frictionlessdata/tabulator-py/issues/177) need only be fixed in one library.

Investigate the feasibility of using tabulator.

Example:

tabulator.config.BYTES_SAMPLE_SIZE = 1000000
stream = Stream('../WFPVAM_MM35_Big-Table.xlsx', headers=13, fill_merged_cells=True)
stream.open()
print(stream.headers)
for row in stream:
    print(row)
stream.close()
davidmegginson commented 7 years ago

I like this idea. I'd be happy to offload the low-level CSV/Excel/etc wrangling to Frictionless, rather than adding so much special-purpose code in the libhxl-python hxl.io module. Since it's a major structural change, I'll probably wait a couple of weeks until I'm able to give it a lot of testing, because I don't want to break any of the existing toolchains that rely on the HXL Proxy.

roll commented 7 years ago

@mcarans @davidmegginson Hi. I'm a Frictionless Data implementations tech lead. Feel free to contact me via tabulator issues or gitter https://gitter.im/frictionlessdata/chat (or PM) if there will be problems with the integration.

davidmegginson commented 7 years ago

Thank you very much. I'm hoping to free up time to look at Frictionless soon. Right now, we use a combination of xlrd and custom code to read tabular data from the following types of sources:

There's also smart code that knows how to pull out the right data when given a regular Google Sheet URL, Dropbox URL, HDX resource URL, etc.

We'd love to add other types of sources, including OpenOffice Calc spreadsheets, and property lists in zipped Shapefiles. If Frictionless already supports all of this, it would give us a great start (and let us retire some custom code as well). I'll investigate soon.

pwalsh commented 7 years ago

Hi @davidmegginson Tabulator supports all of the above formats at present, including OpenOffice Calc spreadsheets. There is also a clean and simple API to extend for new formats as needed.

See the parser modules for a complete list of what is currently supported out-of-the-box in Tabulator:

https://github.com/frictionlessdata/tabulator-py/tree/master/tabulator/parsers

davidmegginson commented 7 years ago

Also look at https://github.com/reubano/meza

More from @mcarans : "Frictionless are plannign to add Meza support into Tabulator using it to read supported file formats that Tabulator currently does not handle"

roll commented 7 years ago

@mcarans @davidmegginson Hi. Do you still consider using tabulator? If yes I've created a special tag libhxl on the tabulator issue tracker - https://github.com/frictionlessdata/tabulator-py/issues?q=is%3Aissue+is%3Aopen+label%3A%7Blibhxl%7D. Do this issues have any kind of priority? Or there are blockers for an integration MVP?

mcarans commented 7 years ago

@roll, yes we are still considering using Tabulator in libhxl.

It is already in use in HDX Python library. I am also using it for other projects including with a partner organisation.

Ignore columns with falsy headers seems more like a bug to me so think it's important to get it fixed. Also the multiline headers handling would be very useful to libhxl.

I would imagine that the zipped csv/xls/xlsx one would be relatively easy to add and also getting sheet by name in Excel/OpenOffice. I would rate these currently as next highest priority.

libhxl needs JSON input and output support not just in the common list of dicts format:

[
  {
     "MyVeryVeryVeryVeryVeryLongField1": 1,
     "MyVeryVeryVeryVeryVeryLongField2": "tree"
  },
  {
     "MyVeryVeryVeryVeryVeryLongField1": 5,
     "MyVeryVeryVeryVeryVeryLongField2": "leaf"
  },
  {
     "MyVeryVeryVeryVeryVeryLongField1": 3,
     "MyVeryVeryVeryVeryVeryLongField2": "root"
  }
]

but also in the compact list of lists format:

[
  [
     "MyVeryVeryVeryVeryVeryLongField1",
     "MyVeryVeryVeryVeryVeryLongField2"
  ],
  [
     1,
     "tree"
  ],
  [
     5,
     "leaf"
  ],
  [
     3,
     "root"
  ]
]

I've made an issue for this: https://github.com/frictionlessdata/tabulator-py/issues/198

Other formats (eg. from Meza) are less important (nice to have).

roll commented 7 years ago

@mcarans Thanks. I agree with your feature implementation complexity (simplicity=) analysis. So we'll try to ship it as soon as possible.

Also answered to you on https://github.com/frictionlessdata/tabulator-py/issues/198

mcarans commented 7 years ago

@davidmegginson I've also added a ticket for outputting JSON to Tabulator. Is there anything else you can think of that might be missing? In particular, I'm wondering how you would capture the proposed JSON output from Tabulator to send out over the web in the HXL Proxy?