EticaAI / HXL-Data-Science-file-formats

Common file formats used for Data Science and language localization exported from (and to) HXL (The Humanitarian Exchange Language)
https://hdp.etica.ai/
The Unlicense
3 stars 1 forks source link

`hxl2tab`: tab format, focused for compatibility with Orange Data Mining #2

Closed fititnt closed 2 years ago

fititnt commented 3 years ago

TODO: add more information.

fititnt commented 3 years ago

The EticaAI-Data_HXL-Data-Science-file-formats_Tab already have an draft of an table that could be used to make an Expert system without the need of full machine learning models.

But for this implementation, I think that we can simply implement both the more specific prefixes, like the +vt_orange_, and and maybe some special more generic attributes to be used with #3, like the one to mention the "class" (both Orange and Weka use class).

Captura de tela de 2021-01-25 23-36-09

fititnt commented 3 years ago

Ok. Interesting. Here the Orange 'Simplified header' specification

Captura de tela de 2021-01-27 23-05-53


While not ideal, the HXLated output without text headers actually are pretty similar to what orange would expect. The biggest difference is that everything after the # the orange consider as textual header, but before this is possible to add a few extra short variables.

Captura de tela de 2021-01-27 23-01-11 Captura de tela de 2021-01-27 23-02-52

fititnt commented 3 years ago
hxl2tab https://docs.google.com/spreadsheets/d/1Vqv6-EAdSHMSZvZtE426aXkDiwP8Mdrpft3tiGQ1RH0/edit#gid=0 temp/example-ebola-dataset-1_HXLated+tab.csv

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ head temp/example-ebola-dataset-1_HXLated+tab.csv
#status #country    #adm1   #adm1+code  #loc    #loc    #org    #loc+type   #affected+dead  #affected+confirmed #affected+suspected
Pending Liberia Margibi LR09    Kakata 1    Kakata 2 AFL    AFL ETC 0   0   0
Functional  Guinea  Nzerekore   GN008       Nzerekore   Ailema (?)  ETC 45  56  3
Pending Liberia River Gee   LR13    Fishtown    Fishtown ETC    American Red Cross  ETC 0   0   0
Functional  Sierra Leone    Western SL04    Jui Sierra Leone-China Friendship Hospital (Jui Hospital)   Chinese CDC ETC 47  65  17
Pending Guinea  Nzerekore   GN008           Croix-Rouge française  ETC 0   0   0
Pending Sierra Leone    Western SL04    Freetown    Goderich    EMERGENCY   ETC 0   0   0
Functional  Sierra Leone    Western SL04    Lakka   Lakka Hospital ETU  EMERGENCY Italian NGO   ETC 3   17  11
Functional  Liberia Margibi LR09    Firestone   Firestone Medical Center    Firestone Company   ETC 14  29  19
Functional  Liberia Montserrado LR11    Monrovia    Monrovia, Congo Town - Old Ministry of Defence ETU 1    FMT ETC 1   30  6

hxl2tab https://docs.google.com/spreadsheets/d/1Vqv6-EAdSHMSZvZtE426aXkDiwP8Mdrpft3tiGQ1RH0/edit#gid=0 temp/example-ebola-dataset-1_HXLated+tab_hxltabv15.tab

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ head temp/example-ebola-dataset-1_HXLated+tab_hxltabv15.tab
cD#status+vt_categorical+vt_class   D#country+vt_categorical    D#adm1+vt_categorical   D#adm1+code+vt_categorical  D#loc+vt_categorical    D#loc+vt_categorical    D#org+vt_categorical    #loc+type+vt_meta   C#affected+dead+number  C#affected+confirmed+number C#affected+suspected+number
Pending Liberia Margibi LR09    Kakata 1    Kakata 2 AFL    AFL ETC 0   0   0
Functional  Guinea  Nzerekore   GN008       Nzerekore   Ailema (?)  ETC 45  56  3
Pending Liberia River Gee   LR13    Fishtown    Fishtown ETC    American Red Cross  ETC 0   0   0
Functional  Sierra Leone    Western SL04    Jui Sierra Leone-China Friendship Hospital (Jui Hospital)   Chinese CDC ETC 47  65  17
Pending Guinea  Nzerekore   GN008           Croix-Rouge française  ETC 0   0   0
Pending Sierra Leone    Western SL04    Freetown    Goderich    EMERGENCY   ETC 0   0   0
Functional  Sierra Leone    Western SL04    Lakka   Lakka Hospital ETU  EMERGENCY Italian NGO   ETC 3   17  11
Functional  Liberia Margibi LR09    Firestone   Firestone Medical Center    Firestone Company   ETC 14  29  19
Functional  Liberia Montserrado LR11    Monrovia    Monrovia, Congo Town - Old Ministry of Defence ETU 1    FMT ETC 1   30  6

Captura de tela de 2021-02-06 19-25-58

fititnt commented 3 years ago

Humm, from this semi-random Reddit thread I found this https://github.com/hugapi/hug. So, in theory, is possible to do an hackish way to expose cli interface as webapp. At bare minimum this can help with pass to orange an URL (even if local) instead of manually save the file with the cli app.

The post cites other alternatives, but this one requires less dependencies and low number of changes. Also for some quick tests, if need to quick expose the URL without setup remote server, would be possible to use ngrok (https://ngrok.com/), so it may be useful if someone elses need something for a quick period and any randon people from community just send an private URL from their computer and solve the issue util something better comes.

Captura de tela de 2021-02-07 10-12-36

fititnt commented 2 years ago

A proof of concept exist since at least v0.8.7.1, and is documented on the main README.md.

This can be used standalone, but still require original dataset already be valid HXL and have some tags like +vt_orange_flag_class to work as hint for the export to Orange.

Trivia: the hxlquickmeta is one way to automate how a dataset could be tagged to be used with hxl2tab (which could be useful for very large datasets with so many columns. But the inner parts of bin/hxl2tab still need edit python code (not like most other new tools here with fully configurable ontologies with YAML.


From the README:

1.2.2 hxl2tab: tab format, focused for compatibility with Orange Data Mining

What it does: hxl2tab uses an already HXLated dataset and then, based on #hashtag+attributes, generates an Orange Data Mining .tab format with extra hints.

The hxl2tab v2.0 has some usable functionality to use a web interface instead of cli to generate the file. Uses hug 🐨 🤗.

If you want quick expose outside localhost, try ngrok.

Installation

This package can both be installed by doing a copy of bin/hxl2tab to a place on your executable path and installing dependencies manually.

The automated way to your path or as part of the Python pypi package hdp-toolchain already with extra dependencies is:

python3 -m pip install hdp-toolchain[hxl2tab]

# python3 -m pip install hdp-toolchain[full]