EticaAI / HXL-Data-Science-file-formats

Common file formats used for Data Science and language localization exported from (and to) HXL (The Humanitarian Exchange Language)
https://hdp.etica.ai/
The Unlicense
3 stars 1 forks source link

`hxlquickimport` #6

Closed fititnt closed 3 years ago

fititnt commented 3 years ago

Meta

hxl +public  
meta +status working-draft
meta +id EticaAI-Data_HXL-Data-Science-file-formats_hxlquickimport
meta +discussion+public https://github.com/EticaAI/HXL-Data-Science-file-formats/issues/6
meta +hxlproxy +url https://proxy.hxlstandard.org/data?dest=data_view&url=https%3A%2F%2Fdocs.google.com%2Fspreadsheets%2Fd%2F1vFkBSharAEg5g5K2u_iDLCBvpWWPqpzC1hcL6QpFNZY%2Fedit%23gid%3D1097528220
meta +description hxlquickimport is a quick (and wrong) way to importnon-HXL dataset (like an .csv or .xlsx, but requires headers already on thefirst row) without human intervention. It will try to slugify the originalheader and add as +attributefor a base hashtag like #meta.The result may be an HXL with valid syntax (that can be used for automatedtesting) but most HXL powered tools would still be human review.How does it work?"[Max Power] Kids: there's three ways to do things; the right way,the wrong way and the Max Power way![Bart Simpson] Isn't that the wrong way?[Max Power] Yeah, but faster!"(via https://www.youtube.com/watch?v=7P0JM3h7IQk)How to do it the right way?Read the documentation on https://hxlstandard.org/.(Tip: both HXL Postcards and the hxl-hashtag-chooser are very helpful!)

Spreadsheet data

See EticaAI-Data_HXL-Data-Science-file-formats_hxlquickimport (https://docs.google.com/spreadsheets/d/1vFkBSharAEg5g5K2u_iDLCBvpWWPqpzC1hcL6QpFNZY/edit#gid=1097528220) for updated content. This is an snapshot.

Category Nome URL URL source
#item+category #item +name #item +url #item +source +url
test-dataset mx.gob.dados_dataset_informacion-referente-a-casos-covid-19-en-mexico_2020-06-01.csv https://drive.google.com/file/d/1nQAu6lHvdh2AV7q6aewGBQIxFz7VrCF9/view?usp=sharing https://github.com/CMedelR/dataCovid19
test-dataset br.einstein_dataset_covid-pacientes-hospital-albert-einstein-anonimizado_2020-03-28_before-HXLate https://docs.google.com/spreadsheets/d/1GQVrCQGEetx7RmKaZJ8eD5dgsr5i1zNy_UJpX3_AgrE/edit?usp=sharing https://www.kaggle.com/einsteindata4u/covid19
research-paper data-mining-for-the-study-of-the-epidemic-sars-cov-2-covid-19-algorithm-for-the-identification-of-patients-sars-cov-2-covid-19-in-mexico.pdf https://drive.google.com/file/d/1WaW2b7bGiSZjvc4OdA0kjrBtRTkKV11N/view?usp=sharing https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3619549
fititnt commented 3 years ago

Thanks to @CMedelR!!!

Not only Ramírez have an research paper called Data mining for the study of the Epidemic (SARS- CoV-2) COVID-19: Algorithm for the identification of patients (SARS-CoV-2) COVID 19 in Mexico and his repository at https://github.com/CMedelR/dataCovid19 have an backup copy of the (at the moment) offline link at https://datos.gob.mx/busca/dataset/informacion-referente-a-casos-covid-19-en-mexico, but his paper explicitly mention the use of the Orange Data Mining!

While his dataset will be used as additional test sample (the previous one was initially only the one from Albert Einstein Hospital on São Paulo), we're also adding his paper, since I'm very sure more people would like to find it later!

fititnt commented 3 years ago

The hxlquickmeta (cli tool) + HXLMeta (Usable Class) #9, while able to fallback and use Pandas and then Orange Data Mining, still fails with something like hxlquickmeta tests/files/iris.csv.

I think that at least for very basic CSV files, the hxlquickmeta could implement the features of hxlquickimport.


fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ hxlquickmeta tests/files/iris.csv
> Connection overview
 >> TODO: implement raw connection, HTTP headers, etc
 >>       (this should output debug information even
 >>       for inputs that would break libhxl)
ERROR! libhxl and/or HXLmeta/HXLMetaExtras failed <HXLException: HXL tags not found in first 25 rows>
Ok. Trying harder now with HXLMetaExtras...

 >> HXLMetaExtras: Pandas DataFrame 
   >>> DataFrame
     sepallength  sepalwidth  petallength  petalwidth           class
0            5.1         3.5          1.4         0.2     Iris-setosa
1            4.9         3.0          1.4         0.2     Iris-setosa
2            4.7         3.2          1.3         0.2     Iris-setosa
3            4.6         3.1          1.5         0.2     Iris-setosa
4            5.0         3.6          1.4         0.2     Iris-setosa
..           ...         ...          ...         ...             ...
145          6.7         3.0          5.2         2.3  Iris-virginica
146          6.3         2.5          5.0         1.9  Iris-virginica
147          6.5         3.0          5.2         2.0  Iris-virginica
148          6.2         3.4          5.4         2.3  Iris-virginica
149          5.9         3.0          5.1         1.8  Iris-virginica

[150 rows x 5 columns]
   >>> DataFrame.T
                     0            1            2            3            4            5    ...             144             145             146             147             148             149
sepallength          5.1          4.9          4.7          4.6          5.0          5.4  ...             6.7             6.7             6.3             6.5             6.2             5.9
sepalwidth           3.5          3.0          3.2          3.1          3.6          3.9  ...             3.3             3.0             2.5             3.0             3.4             3.0
petallength          1.4          1.4          1.3          1.5          1.4          1.7  ...             5.7             5.2             5.0             5.2             5.4             5.1
petalwidth           0.2          0.2          0.2          0.2          0.2          0.4  ...             2.5             2.3             1.9             2.0             2.3             1.8
class        Iris-setosa  Iris-setosa  Iris-setosa  Iris-setosa  Iris-setosa  Iris-setosa  ...  Iris-virginica  Iris-virginica  Iris-virginica  Iris-virginica  Iris-virginica  Iris-virginica

[5 rows x 150 columns]
   >>> DataFrame.describe
       sepallength  sepalwidth  petallength  petalwidth
count   150.000000  150.000000   150.000000  150.000000
mean      5.843333    3.054000     3.758667    1.198667
std       0.828066    0.433594     1.764420    0.763161
min       4.300000    2.000000     1.000000    0.100000
25%       5.100000    2.800000     1.600000    0.300000
50%       5.800000    3.000000     4.350000    1.300000
75%       6.400000    3.300000     5.100000    1.800000
max       7.900000    4.400000     6.900000    2.500000

 >> HXLMetaExtras: Orange Data Mining
data.domain [sepallength, sepalwidth, petallength, petalwidth, class]
data.columns <Orange.data.table.Columns object at 0x7f416848cd30>
fititnt commented 3 years ago

I think that at least for very basic CSV files, the hxlquickmeta could implement the features of hxlquickimport.

My last comment can be ignored. Actually this may not need. As long as hxlquickmeta accept stdin (be piped) and all other tools work with pipes (the standard ones from HXLStandard works!) its not need at all implement this.

So instead of hxlquickmeta tests/files/iris.csv is just hxlquickimport tests/files/iris.csv | hxlquickmeta

this makes hxlquickmeta fails

# Non HXLated file
hxlquickmeta tests/files/iris.csv
(...)
ERROR! libhxl and/or HXLmeta/HXLMetaExtras failed <HXLException: HXL tags not found in first 25 rows>
Ok. Trying harder now with HXLMetaExtras...
(...)

This ones works (but not for complex Excel files)

# Non HXLated file
hxlquickimport tests/files/iris.csv | hxlquickmeta
## (...)
> lihxl-python overview
 >> output.output <_io.TextIOWrapper name='/tmp/tmphdplthem' mode='w' encoding='UTF-8'>
 >> source <hxl.io.HXLReader object at 0x7fc33c008820>

> HXLMeta debuginfo
 >> HXLMeta.text_headers None
 >> HXLMeta.hxl_headers ['#item+sepallength', '#item+sepalwidth', '#item+petallength', '#item+petalwidth', '#item+class']
> get_hashtag_info [ #item+sepallength ] [ None ]
(...)

Potential problem with hxlquickmeta if would not work with streams

I will make this comment on other issue. So it keeps notes for future.

fititnt commented 3 years ago

The hxlquickimport already have an working proof of concept, and since is an all-in-one single file, can work even without [meta issue] hxlm #11 or the [meta] hxlm.core. As long as the depended libraries are installed, just need to put the bin/hxlquickimport on working path.

If need, this issue could be re-opened, but the current version of bin/hxlquickimport (single is mostly an hxltag with implicitly defaults, either could be something I would propose add to the HXLStandard/libhxl-python

Eventual point to be done (but not today)

Without actually doing a full refactoring to use something like the hxlm.core (or more 'pythonic'), maybe the bin/hxlquickimport will be moved to when installing this repository with

pip install https://github.com/EticaAI/HXL-Data-Science-file-formats

With this, at least would be more intuitive to explain another strategy of how to use these tools (and then the Minimal documentation about how to use the command line tools #1 could be solved)