Closed mcarans closed 5 years ago
I like this idea. I'd be happy to offload the low-level CSV/Excel/etc wrangling to Frictionless, rather than adding so much special-purpose code in the libhxl-python hxl.io module. Since it's a major structural change, I'll probably wait a couple of weeks until I'm able to give it a lot of testing, because I don't want to break any of the existing toolchains that rely on the HXL Proxy.
@mcarans @davidmegginson Hi. I'm a Frictionless Data implementations tech lead. Feel free to contact me via tabulator issues or gitter https://gitter.im/frictionlessdata/chat (or PM) if there will be problems with the integration.
Thank you very much. I'm hoping to free up time to look at Frictionless soon. Right now, we use a combination of xlrd and custom code to read tabular data from the following types of sources:
There's also smart code that knows how to pull out the right data when given a regular Google Sheet URL, Dropbox URL, HDX resource URL, etc.
We'd love to add other types of sources, including OpenOffice Calc spreadsheets, and property lists in zipped Shapefiles. If Frictionless already supports all of this, it would give us a great start (and let us retire some custom code as well). I'll investigate soon.
Hi @davidmegginson Tabulator supports all of the above formats at present, including OpenOffice Calc spreadsheets. There is also a clean and simple API to extend for new formats as needed.
See the parser modules for a complete list of what is currently supported out-of-the-box in Tabulator:
https://github.com/frictionlessdata/tabulator-py/tree/master/tabulator/parsers
Also look at https://github.com/reubano/meza
More from @mcarans : "Frictionless are plannign to add Meza support into Tabulator using it to read supported file formats that Tabulator currently does not handle"
@mcarans
@davidmegginson
Hi. Do you still consider using tabulator
? If yes I've created a special tag libhxl
on the tabulator
issue tracker - https://github.com/frictionlessdata/tabulator-py/issues?q=is%3Aissue+is%3Aopen+label%3A%7Blibhxl%7D. Do this issues have any kind of priority? Or there are blockers for an integration MVP?
@roll, yes we are still considering using Tabulator in libhxl.
It is already in use in HDX Python library. I am also using it for other projects including with a partner organisation.
Ignore columns with falsy headers seems more like a bug to me so think it's important to get it fixed. Also the multiline headers handling would be very useful to libhxl.
I would imagine that the zipped csv/xls/xlsx one would be relatively easy to add and also getting sheet by name in Excel/OpenOffice. I would rate these currently as next highest priority.
libhxl needs JSON input and output support not just in the common list of dicts format:
[
{
"MyVeryVeryVeryVeryVeryLongField1": 1,
"MyVeryVeryVeryVeryVeryLongField2": "tree"
},
{
"MyVeryVeryVeryVeryVeryLongField1": 5,
"MyVeryVeryVeryVeryVeryLongField2": "leaf"
},
{
"MyVeryVeryVeryVeryVeryLongField1": 3,
"MyVeryVeryVeryVeryVeryLongField2": "root"
}
]
but also in the compact list of lists format:
[
[
"MyVeryVeryVeryVeryVeryLongField1",
"MyVeryVeryVeryVeryVeryLongField2"
],
[
1,
"tree"
],
[
5,
"leaf"
],
[
3,
"root"
]
]
I've made an issue for this: https://github.com/frictionlessdata/tabulator-py/issues/198
Other formats (eg. from Meza) are less important (nice to have).
@mcarans Thanks. I agree with your feature implementation complexity (simplicity=) analysis. So we'll try to ship it as soon as possible.
Also answered to you on https://github.com/frictionlessdata/tabulator-py/issues/198
@davidmegginson I've also added a ticket for outputting JSON to Tabulator. Is there anything else you can think of that might be missing? In particular, I'm wondering how you would capture the proposed JSON output from Tabulator to send out over the web in the HXL Proxy?
As libhxl streams data in various formats it may be sensible, if feasible, to use https://github.com/frictionlessdata/tabulator-py rather than handling each format (csv, xls, xlsx, json etc.) within libhxl.
Then issues like handling merged cells in xls, xlsx (https://github.com/frictionlessdata/tabulator-py/issues/171) and handling multi line headers (https://github.com/frictionlessdata/tabulator-py/issues/177) need only be fixed in one library.
Investigate the feasibility of using tabulator.
Example: