`hxl2pandas`: Pandas DataFrame

hxl +public
meta +status	working-draft
meta +discussion+public
meta +id	EticaAI-Data_HXL-Data-Science-file-formats_Pandas
meta +hxlproxy +url	https://proxy.hxlstandard.org/data?dest=data_view&url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1vFkBSharAEg5g5K2u_iDLCBvpWWPqpzC1hcL6QpFNZY%2Fedit%23gid%3D723336363
meta +specification +url	https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#dtypes
meta +seealso +url	https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html
meta +description +i_eng	Important point: both the `hxl2pandas` and theEticaAI-Data_HXL-Data-Science-file-formats_Pandas reference tableare mostly as reference of how pandas (more specifically DataFrame)could be used as an intermediate format to export HXL to other formatsalready supported by Pandas.While the reference table may still be useful for those who are doingmanual conversionor to help understand how different tools used fordata mining / machine learningwould use HXL attributes, the `hxl2pandas`may not be implemented at all. Alsosome of the intermediate formats maybe converted using other libraries.

At this moment I'm not 100% sure if using pandas just because it allows to export to several formats may be a good approach.

Fist, there is a problem with overhead (but this alone is not main reason). But if the underlining libraries could eventually allow store some additional metadata (like to be able to reconstruction the source hashtags, could would be very nice to have.

The overhead start to become a problem if is 100% granted that the DataFrame loads everything on memory (even if is just numerical representation of strings) before save the formats. While this still more efficient than like load entire Excel file or CSVs, I think that if someone would be using this to convert from an huge CSV, it would be acceptable to be slower, like first save to an local file on /tmp, and then convert the HXLated CSV using the header as additional instructions for whatever would be the new format and use the most efficient loader as possible.

Anyway, this if have to focus, the strategies that generate file formats that do have friendly interfaces (like Orange and Weka; both may not require any command line commands at all to use) seems more an win-win over formats that the end user could simply consume CSVs directly. But these advanced cases can still serve as reference on how to choose the attributes and not just consider two applications (Orange and Weka).

EticaAI / HXL-Data-Science-file-formats

`hxl2pandas`: Pandas DataFrame #4