EticaAI / HXL-Data-Science-file-formats

Common file formats used for Data Science and language localization exported from (and to) HXL (The Humanitarian Exchange Language)
https://hdp.etica.ai/
The Unlicense
3 stars 1 forks source link

`hxl2pandas`: Pandas DataFrame #4

Open fititnt opened 3 years ago

fititnt commented 3 years ago
hxl +public  
meta +status working-draft
meta +discussion+public  
meta +id EticaAI-Data_HXL-Data-Science-file-formats_Pandas
meta +hxlproxy +url https://proxy.hxlstandard.org/data?dest=data_view&url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1vFkBSharAEg5g5K2u_iDLCBvpWWPqpzC1hcL6QpFNZY%2Fedit%23gid%3D723336363
meta +specification +url https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#dtypes
meta +seealso +url https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html
meta +description +i_eng Important point: both the hxl2pandas and theEticaAI-Data_HXL-Data-Science-file-formats_Pandas reference tableare mostly as reference of how pandas (more specifically DataFrame)could be used as an intermediate format to export HXL to other formatsalready supported by Pandas.While the reference table may still be useful for those who are doingmanual conversionor to help understand how different tools used fordata mining / machine learningwould use HXL attributes, the hxl2pandasmay not be implemented at all. Alsosome of the intermediate formats maybe converted using other libraries.

At this moment I'm not 100% sure if using pandas just because it allows to export to several formats may be a good approach.

Fist, there is a problem with overhead (but this alone is not main reason). But if the underlining libraries could eventually allow store some additional metadata (like to be able to reconstruction the source hashtags, could would be very nice to have.

The overhead start to become a problem if is 100% granted that the DataFrame loads everything on memory (even if is just numerical representation of strings) before save the formats. While this still more efficient than like load entire Excel file or CSVs, I think that if someone would be using this to convert from an huge CSV, it would be acceptable to be slower, like first save to an local file on /tmp, and then convert the HXLated CSV using the header as additional instructions for whatever would be the new format and use the most efficient loader as possible.

Anyway, this if have to focus, the strategies that generate file formats that do have friendly interfaces (like Orange and Weka; both may not require any command line commands at all to use) seems more an win-win over formats that the end user could simply consume CSVs directly. But these advanced cases can still serve as reference on how to choose the attributes and not just consider two applications (Orange and Weka).