Closed fititnt closed 3 years ago
Thanks to @CMedelR!!!
Not only Ramírez have an research paper called Data mining for the study of the Epidemic (SARS- CoV-2) COVID-19: Algorithm for the identification of patients (SARS-CoV-2) COVID 19 in Mexico and his repository at https://github.com/CMedelR/dataCovid19 have an backup copy of the (at the moment) offline link at https://datos.gob.mx/busca/dataset/informacion-referente-a-casos-covid-19-en-mexico, but his paper explicitly mention the use of the Orange Data Mining!
While his dataset will be used as additional test sample (the previous one was initially only the one from Albert Einstein Hospital on São Paulo), we're also adding his paper, since I'm very sure more people would like to find it later!
The hxlquickmeta
(cli tool) + HXLMeta (Usable Class) #9, while able to fallback and use Pandas and then Orange Data Mining, still fails with something like hxlquickmeta tests/files/iris.csv
.
I think that at least for very basic CSV files, the hxlquickmeta
could implement the features of hxlquickimport
.
fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ hxlquickmeta tests/files/iris.csv
> Connection overview
>> TODO: implement raw connection, HTTP headers, etc
>> (this should output debug information even
>> for inputs that would break libhxl)
ERROR! libhxl and/or HXLmeta/HXLMetaExtras failed <HXLException: HXL tags not found in first 25 rows>
Ok. Trying harder now with HXLMetaExtras...
>> HXLMetaExtras: Pandas DataFrame
>>> DataFrame
sepallength sepalwidth petallength petalwidth class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica
[150 rows x 5 columns]
>>> DataFrame.T
0 1 2 3 4 5 ... 144 145 146 147 148 149
sepallength 5.1 4.9 4.7 4.6 5.0 5.4 ... 6.7 6.7 6.3 6.5 6.2 5.9
sepalwidth 3.5 3.0 3.2 3.1 3.6 3.9 ... 3.3 3.0 2.5 3.0 3.4 3.0
petallength 1.4 1.4 1.3 1.5 1.4 1.7 ... 5.7 5.2 5.0 5.2 5.4 5.1
petalwidth 0.2 0.2 0.2 0.2 0.2 0.4 ... 2.5 2.3 1.9 2.0 2.3 1.8
class Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa ... Iris-virginica Iris-virginica Iris-virginica Iris-virginica Iris-virginica Iris-virginica
[5 rows x 150 columns]
>>> DataFrame.describe
sepallength sepalwidth petallength petalwidth
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
>> HXLMetaExtras: Orange Data Mining
data.domain [sepallength, sepalwidth, petallength, petalwidth, class]
data.columns <Orange.data.table.Columns object at 0x7f416848cd30>
I think that at least for very basic CSV files, the hxlquickmeta could implement the features of hxlquickimport.
My last comment can be ignored. Actually this may not need. As long as hxlquickmeta accept stdin (be piped) and all other tools work with pipes (the standard ones from HXLStandard works!) its not need at all implement this.
So instead of hxlquickmeta tests/files/iris.csv
is just hxlquickimport tests/files/iris.csv | hxlquickmeta
# Non HXLated file
hxlquickmeta tests/files/iris.csv
(...)
ERROR! libhxl and/or HXLmeta/HXLMetaExtras failed <HXLException: HXL tags not found in first 25 rows>
Ok. Trying harder now with HXLMetaExtras...
(...)
# Non HXLated file
hxlquickimport tests/files/iris.csv | hxlquickmeta
## (...)
> lihxl-python overview
>> output.output <_io.TextIOWrapper name='/tmp/tmphdplthem' mode='w' encoding='UTF-8'>
>> source <hxl.io.HXLReader object at 0x7fc33c008820>
> HXLMeta debuginfo
>> HXLMeta.text_headers None
>> HXLMeta.hxl_headers ['#item+sepallength', '#item+sepalwidth', '#item+petallength', '#item+petalwidth', '#item+class']
> get_hashtag_info [ #item+sepallength ] [ None ]
(...)
hxlquickmeta
if would not work with streamsI will make this comment on other issue. So it keeps notes for future.
The hxlquickimport
already have an working proof of concept, and since is an all-in-one single file, can work even without [meta issue] hxlm #11 or the [meta] hxlm.core. As long as the depended libraries are installed, just need to put the bin/hxlquickimport
on working path.
If need, this issue could be re-opened, but the current version of bin/hxlquickimport
(single is mostly an hxltag
with implicitly defaults, either could be something I would propose add to the HXLStandard/libhxl-python
Without actually doing a full refactoring to use something like the hxlm.core (or more 'pythonic'), maybe the bin/hxlquickimport
will be moved to when installing this repository with
pip install https://github.com/EticaAI/HXL-Data-Science-file-formats
With this, at least would be more intuitive to explain another strategy of how to use these tools (and then the Minimal documentation about how to use the command line tools #1 could be solved)
Meta
Spreadsheet data
See EticaAI-Data_HXL-Data-Science-file-formats_hxlquickimport (https://docs.google.com/spreadsheets/d/1vFkBSharAEg5g5K2u_iDLCBvpWWPqpzC1hcL6QpFNZY/edit#gid=1097528220) for updated content. This is an snapshot.