EticaAI / HXL-Data-Science-file-formats

Common file formats used for Data Science and language localization exported from (and to) HXL (The Humanitarian Exchange Language)
https://hdp.etica.ai/
The Unlicense
3 stars 1 forks source link

`hxlquickmeta` (cli tool) + HXLMeta (Usable Class) #9

Closed fititnt closed 3 years ago

fititnt commented 3 years ago

One feature of the HXLTabConverter common class #8 (since we're already reading all documentation to see how to make inferences without forcing users to use type hints in all places) actually requires knowing the supposed data types of already HXLated datasets. So, let's break in an separate class [and as much as possible already try to use data structures that could be converted from JSON or something] to create something that actually could make these inferences

The more specific HXL Core hashtags

One advantage of using the hashtag that already is the very own defined on the specification is that the specification for several cases enforce the types. This happens on special for indicators. So, actually, is possible to (at least if is not doing something like brute forcing with the hxlquickimport) be somewhat sure about what to expect from the data columns.

Which accuracy to aim?

Note: "accuracy" in this case means, when the user does not explicitly already enforce on the source HXLated dataset the "data types" or "data flags", suggest something that could be corrected.

In my personal, honest opinion, >90% of the cases is good enough, including making inferences beyond the official documentation (but at this point may need to do some checking on at least a good amount of rows to deduce. But should exist one way that allows users to explicitly enforce (even if it means a more verbose attribute).

Maybe a different approach to tolerate even less accuracy on first try (think like >75%", maybe less) is if is possible to easily import back the exported format (think the .tab from Orange Data Mining, but could be Weka and others) we assume that the data types and data flags (is meta? This can be ignored? Etc) could already be imported back with more data type hints that if exported again would not change.

In other words: for very long spreadsheets, somewhat already optimized to be corrected on an external program. (I think this is much more likely to happen for data flags than data types, in fact we may need to create some way to allow more than one target variable).

How to warn the suggestions outside what already is strictly defined on the HXL Standard

Also, since already do exist the concept of Debug logs, I think when we try to make inferences on the tags that are less than 90% (or maybe we discover that an analysis of 100 (or up to 10.000) the user literally done poor tagging and this is 98% likely to fail on external data mining tools, we still warn the user (This type of feature would be need if trying to brute force with the hxlquickimport, so at least some quick checks could already exist).

fititnt commented 3 years ago

Oh boy, name is complicated.

HXLMeta_Glossary

The EticaAI-Data_HXL-Data-Science-file-formats have an HXLMeta_Glossary to describe concepts (actually, some sort of "ID" to reference on other documents.

This still not the final result, but there is so many concepts on so many programs, that to make some sense or everything I'm trying go put some names. But eventually we could come with some nice ways to mention it.

References

Also, the table EticaAI-Data_HXL-Data-Science-file-formats_References have part of documents I'm just letting there. Some of these may be used later as reference to the actually development.

TODO: software comparison

Maybe at some point will be necessary put on a place some of the software we're testing, but instead of have one spreadsheet for every software, have some by how the terms relate to each other.

I mean: how HXL +text is used on other tools, and how other tools use to be converted back.

Another point is that some of open source tools actually are more powerfully than others and can open several proprietary formats, like the Jamovi (https://www.jamovi.org/), so this table in special could be used by users beyond us to check alternatives.


See:

Captura de tela de 2021-02-11 22-41-05

fititnt commented 3 years ago

The EticaAI-Data_HXL-Data-Science-file-formats_HXLMeta_StatisticalType table still an working draft, but both Statistical Data Type (https://en.wikipedia.org/wiki/Statistical_data_type) and Level of measurement (https://en.wikipedia.org/wiki/Level_of_measurement) may actually worth to use as internal taxonomy, since is the most close to an way to translate different variable types between different softwares. Or, in other words: since most of them already use math (but some often use some terms to refer to a group of math taxonomy), we could internaly use the more close to the math taxonomy and then, by program, translate how each one use them.

This is likely to be more hard to make initial draft (like very hard), but I think that may be much more easy later.

Captura de tela de 2021-02-16 00-45-40

fititnt commented 3 years ago

With the last commit, the hxlquickmeta by default will use pandas to do some quick overview.

At this point, it still threat HXLated files as average CSV so pandas somewhat bruteforce what the columns means. This is not ideal, but hxlquickmeta already can be used as a proxy to quick analyse any remote dataset (still need to test a big more if CSV still not HXLated, but libhxl-python already tolerate CSV files if we use the logic of the hxlquickimport #6

hxlquickmeta https://data.humdata.org/dataset/2d058968-9d7e-49a9-b28f-2895d7f6536f/resource/a12bad12-f5ea-493c-9faa-66cb3f3e9ca7/download/fts_incoming_funding_bra.csv

Captura de tela de 2021-02-20 19-05-24 Captura de tela de 2021-02-20 19-05-36

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ hxlquickmeta https://data.humdata.org/dataset/2d058968-9d7e-49a9-b28f-2895d7f6536f/resource/a12bad12-f5ea-493c-9faa-66cb3f3e9ca7/download/fts_incoming_funding_bra.csv
> Connection overview
 >> TODO: implement raw connection, HTTP headers, etc
 >>       (this should output debug information even
 >>       for inputs that would break libhxl)

> lihxl-python overview
 >> output.output <_io.TextIOWrapper name='/tmp/tmpkwejyuli' mode='w' encoding='UTF-8'>
 >> source <hxl.io.HXLReader object at 0x7f17b3d0fa00>

> HXLMeta debuginfo
 >> HXLMeta.text_headers ['date', 'budgetYear', 'description', 'amountUSD', 'srcOrganization', 'srcOrganizationTypes', 'srcLocations', 'srcUsageYearStart', 'srcUsageYearEnd', 'destPlan', 'destPlanCode', 'destPlanId', 'destOrganization', 'destOrganizationTypes', 'destGlobalClusters', 'destLocations', 'destProject', 'destProjectCode', 'destEmergency', 'destUsageYearStart', 'destUsageYearEnd', 'contributionType', 'flowType', 'method', 'boundary', 'onBoundary', 'status', 'firstReportedDate', 'decisionDate', 'keywords', 'originalAmount', 'originalCurrency', 'exchangeRate', 'id', 'refCode', 'createdAt', 'updatedAt']
 >> HXLMeta.hxl_headers ['#date', '#date+year+budget', '#description+notes', '#value+funding+total+usd', '#org+name+funder', '#org+type+funder+list', '#country+iso3+funder+list', '#date+year+start+funder', '#date+year+end+funder', '#activity+appeal+name', '#activity+appeal+id+external', '#activity+appeal+id+fts_internal', '#org+name+impl', '#org+type+impl+list', '#sector+cluster+name+list', '#country+iso3+impl+list', '#activity+project+name', '#activity+project+code', '#crisis+name', '#date+year+start+impl', '#date+year+end+impl', '#financial+contribution+type', '#financial+contribution+type', '#financial+method', '#financial+direction', '#financial+direction+type', '#status+text', '#date+reported', '#date+decision', '#description+keywords', '#value+funding+total', '#value+funding+total+currency', '#financial+fx', '#activity+id+fts_internal', '#activity+code', '#date+created', '#date+updated']

### (LONG LIST OMMITED) ####

 >> HXLMetaExtras: Pandas DataFrame 
   >>> DataFrame
        #date  #date+year+budget                                 #description+notes  #value+funding+total+usd  ... #activity+id+fts_internal #activity+code #date+created  #date+updated
0  2020-05-24                NaN  Venezuela Migrants Outflows Multiyear 2019 to ...                         0  ...                    210646            NaN    2020-05-24     2020-07-23
1  2019-11-30                NaN  Integral Protection and Humanitarian Assistanc...                         0  ...                    202767    7F-10139.02    2019-12-10     2020-10-20
2  2019-10-31                NaN  Economic Integration of Venezuelan Migrants an...                    225990  ...                    215375    PC-2020-001    2020-07-23     2020-07-23

[3 rows x 37 columns]
   >>> DataFrame.T
                                                                                  0                                                  1                                                  2
#date                                                                    2020-05-24                                         2019-11-30                                         2019-10-31
#date+year+budget                                                               NaN                                                NaN                                                NaN
#description+notes                Venezuela Migrants Outflows Multiyear 2019 to ...  Integral Protection and Humanitarian Assistanc...  Economic Integration of Venezuelan Migrants an...
#value+funding+total+usd                                                          0                                                  0                                             225990
#org+name+funder                  European Commission EuropeAid Development and ...                         Switzerland, Government of            United States of America, Government of
#org+type+funder+list                                            Inter-governmental                                         Government                                         Government
#country+iso3+funder+list                                                       NaN                                                CHE                                                USA
#date+year+start+funder                                                        2019                                               2019                                               2019
#date+year+end+funder                                                          2019                                               2019                                               2019
#activity+appeal+name                                                           NaN                                                NaN                                                NaN
#activity+appeal+id+external                                                    NaN                                                NaN                                                NaN
#activity+appeal+id+fts_internal                                                NaN                                                NaN                                                NaN
#org+name+impl                             International Organization for Migration  Comitato Internationale per lo Sviluppo dei Po...                  International Labour Organization
#org+type+impl+list                                                       UN agency                                                NGO                                          UN agency
#sector+cluster+name+list                                                       NaN                                                NaN                                       Multi-sector
#country+iso3+impl+list           ABW,ARG,BOL,BRA,CHL,COL,CRI,CUW,DOM,ECU,GUY,ME...  ABW,ARG,BOL,BRA,CHL,COL,CRI,CUW,DOM,ECU,GUY,ME...  ABW,ARG,BOL,BRA,CHL,COL,CRI,CUW,DOM,ECU,GUY,ME...
#activity+project+name                                                          NaN                                                NaN                                                NaN
#activity+project+code                                                          NaN                                                NaN                                                NaN
#crisis+name                      VENEZUELA Outflow - Regional Refugees and Migr...  VENEZUELA Outflow - Regional Refugees and Migr...  VENEZUELA Outflow - Regional Refugees and Migr...
#date+year+start+impl                                                          2019                                               2019                                               2021
#date+year+end+impl                                                            2021                                               2021                                               2021
#financial+contribution+type                                              financial                                          financial                                          financial
#financial+contribution+type.1                                               Parked                                             Parked                                           Standard
#financial+method                                                   Traditional aid                                    Traditional aid                                    Traditional aid
#financial+direction                                                       incoming                                           incoming                                           incoming
#financial+direction+type                                                    shared                                             shared                                             shared
#status+text                                                             commitment                                         commitment                                               paid
#date+reported                                                           2020-05-24                                         2019-12-05                                         2020-07-23
#date+decision                                                           2019-05-09                                                NaN                                         2019-10-31
#description+keywords                                                     Multiyear                                          Multiyear                                                NaN
#value+funding+total                                                            NaN                                                0.0                                                NaN
#value+funding+total+currency                                                   NaN                                                CHF                                                NaN
#financial+fx                                                                   NaN                                              0.992                                                NaN
#activity+id+fts_internal                                                    210646                                             202767                                             215375
#activity+code                                                                  NaN                                        7F-10139.02                                        PC-2020-001
#date+created                                                            2020-05-24                                         2019-12-10                                         2020-07-23
#date+updated                                                            2020-07-23                                         2020-10-20                                         2020-07-23
   >>> DataFrame.describe
       #date+year+budget  #value+funding+total+usd  #date+year+start+funder  #date+year+end+funder  ...  #date+year+end+impl  #value+funding+total  #financial+fx  #activity+id+fts_internal
count                0.0                  3.000000                      3.0                    3.0  ...                  3.0                   1.0          1.000                   3.000000
mean                 NaN              75330.000000                   2019.0                 2019.0  ...               2021.0                   0.0          0.992              209596.000000
std                  NaN             130475.387334                      0.0                    0.0  ...                  0.0                   NaN            NaN                6369.245717
min                  NaN                  0.000000                   2019.0                 2019.0  ...               2021.0                   0.0          0.992              202767.000000
25%                  NaN                  0.000000                   2019.0                 2019.0  ...               2021.0                   0.0          0.992              206706.500000
50%                  NaN                  0.000000                   2019.0                 2019.0  ...               2021.0                   0.0          0.992              210646.000000
75%                  NaN             112995.000000                   2019.0                 2019.0  ...               2021.0                   0.0          0.992              213010.500000
max                  NaN             225990.000000                   2019.0                 2019.0  ...               2021.0                   0.0          0.992              215375.000000
fititnt commented 3 years ago

Note: comment https://github.com/EticaAI/HXL-Data-Science-file-formats/issues/6#issuecomment-782917161 also mention difference about work or not with streaming

Note: hxlquickmeta and classes like HXLMeta may not be fully compatible with streams / pipes

The way most (if not all, with exception of one related to spec) work with data streams / pipes actually ate very memory and CPU efficient, but is complicated (not because of the HXL or the lib, but because data streams) to have an full overview of an dataset keeping 100% compatibly.

While I'm not that proficient in Python to actually debug this (maybe do exist ways, but I will focus on get things working first), this note to self or to others is to explain that some features of hxlquickmeta may be easier to implement after an file already saved on the local disk.

Implications

Slower

In general this would not be a problem for something that is not designed to be chained/piped, but at bare minimum is slower than how all other hxl cli commands.

Analyzed files may leave traces on disk

Another implication is that because needs to use temporary files, while (at least for non-Excel or zipped sources I think libhxl may not need at all touch the disk if just pipe commands) is unable to streaming data without touch the disk. This could be an problem in an threat model were you handle sensitive documents in a computer were you have no control over the hardware (and the computer does not have high chance of reuse disk space again). I think that streaming data and just piping the output is possible to not leave traces on temporary files that touch the disk, but with this approach the hxlquickmeta need to save files and even if delete, access to raw bits on disk could access old contents. The Operational System maybe will use memory RAM by default (but the current version does not allow user enforce this).

HXLMeta and other classes may not extended libhxl-python hxl.model.Dataset (not just on short term)

The natural inclination would be extend hxl.model.Dataset. The reason for I do not do this is mostly because I don't know how to make it compatible with data streaming at this moment (also Python is not one of the languages I have more experience in production).

The second reason is because at least some parts that could change too much behavior of the base classes could break more easily after any libhxl-python update. Python, for example, while programmers use convention with _ to means private method, actually allow access even these methods. So at least on parts that I think could break easily may have some copy and paste and later if need we do some refactoring. One potential advantage of this is that even older versions of these minimum viable products may actually keep working even years later.

Do exist some parts that I think, inspired on Pandas and other Python libraries, I will draft (like abstraction about data types, like the one with temporary name BooleanHXLtype), but these the libhxl-python at the moment does not have. But even if part of these could be added to the libhxl-python later, maybe they would not be necessary on the average usage and very likely could introduce much more bugs with data already not prepared. Also these new types would increase memory usage, but since libhxl works by default with streamed data, I think this would not be an issue, but the bugs would.

fititnt commented 3 years ago

Same comments done here https://github.com/EticaAI/HXL-Data-Science-file-formats/issues/6#issuecomment-808873915 apply to hxlquickmeta, but to add more:

Its a bit sad that I did not took time to explain what was the hxlquickmeta, but anyway, we're already going even more deep with HDP #16.


Note to self: Maybe in some weeks I stop like 2~4 days and re-implement the debug features of hxlquickmeta as part of the hxlm.core, but since the target audiences are different (hxlquickmeta is more on the data preparation side), this would need to have some other name than HDP so people would not be scared. I'm not kidding that the hxlquickmeta even had a draft of the hxl2pandas: Pandas DataFrame #4.