Closed fititnt closed 3 years ago
Oh boy, name is complicated.
The EticaAI-Data_HXL-Data-Science-file-formats have an HXLMeta_Glossary to describe concepts (actually, some sort of "ID" to reference on other documents.
This still not the final result, but there is so many concepts on so many programs, that to make some sense or everything I'm trying go put some names. But eventually we could come with some nice ways to mention it.
Also, the table EticaAI-Data_HXL-Data-Science-file-formats_References have part of documents I'm just letting there. Some of these may be used later as reference to the actually development.
Maybe at some point will be necessary put on a place some of the software we're testing, but instead of have one spreadsheet for every software, have some by how the terms relate to each other.
I mean: how HXL +text
is used on other tools, and how other tools use to be converted back.
Another point is that some of open source tools actually are more powerfully than others and can open several proprietary formats, like the Jamovi (https://www.jamovi.org/), so this table in special could be used by users beyond us to check alternatives.
See:
The EticaAI-Data_HXL-Data-Science-file-formats_HXLMeta_StatisticalType table still an working draft, but both Statistical Data Type (https://en.wikipedia.org/wiki/Statistical_data_type) and Level of measurement (https://en.wikipedia.org/wiki/Level_of_measurement) may actually worth to use as internal taxonomy, since is the most close to an way to translate different variable types between different softwares. Or, in other words: since most of them already use math (but some often use some terms to refer to a group of math taxonomy), we could internaly use the more close to the math taxonomy and then, by program, translate how each one use them.
This is likely to be more hard to make initial draft (like very hard), but I think that may be much more easy later.
With the last commit, the hxlquickmeta
by default will use pandas to do some quick overview.
At this point, it still threat HXLated files as average CSV so pandas somewhat bruteforce what the columns means. This is not ideal, but hxlquickmeta already can be used as a proxy to quick analyse any remote dataset (still need to test a big more if CSV still not HXLated, but libhxl-python already tolerate CSV files if we use the logic of the hxlquickimport
#6
hxlquickmeta https://data.humdata.org/dataset/2d058968-9d7e-49a9-b28f-2895d7f6536f/resource/a12bad12-f5ea-493c-9faa-66cb3f3e9ca7/download/fts_incoming_funding_bra.csv
fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ hxlquickmeta https://data.humdata.org/dataset/2d058968-9d7e-49a9-b28f-2895d7f6536f/resource/a12bad12-f5ea-493c-9faa-66cb3f3e9ca7/download/fts_incoming_funding_bra.csv
> Connection overview
>> TODO: implement raw connection, HTTP headers, etc
>> (this should output debug information even
>> for inputs that would break libhxl)
> lihxl-python overview
>> output.output <_io.TextIOWrapper name='/tmp/tmpkwejyuli' mode='w' encoding='UTF-8'>
>> source <hxl.io.HXLReader object at 0x7f17b3d0fa00>
> HXLMeta debuginfo
>> HXLMeta.text_headers ['date', 'budgetYear', 'description', 'amountUSD', 'srcOrganization', 'srcOrganizationTypes', 'srcLocations', 'srcUsageYearStart', 'srcUsageYearEnd', 'destPlan', 'destPlanCode', 'destPlanId', 'destOrganization', 'destOrganizationTypes', 'destGlobalClusters', 'destLocations', 'destProject', 'destProjectCode', 'destEmergency', 'destUsageYearStart', 'destUsageYearEnd', 'contributionType', 'flowType', 'method', 'boundary', 'onBoundary', 'status', 'firstReportedDate', 'decisionDate', 'keywords', 'originalAmount', 'originalCurrency', 'exchangeRate', 'id', 'refCode', 'createdAt', 'updatedAt']
>> HXLMeta.hxl_headers ['#date', '#date+year+budget', '#description+notes', '#value+funding+total+usd', '#org+name+funder', '#org+type+funder+list', '#country+iso3+funder+list', '#date+year+start+funder', '#date+year+end+funder', '#activity+appeal+name', '#activity+appeal+id+external', '#activity+appeal+id+fts_internal', '#org+name+impl', '#org+type+impl+list', '#sector+cluster+name+list', '#country+iso3+impl+list', '#activity+project+name', '#activity+project+code', '#crisis+name', '#date+year+start+impl', '#date+year+end+impl', '#financial+contribution+type', '#financial+contribution+type', '#financial+method', '#financial+direction', '#financial+direction+type', '#status+text', '#date+reported', '#date+decision', '#description+keywords', '#value+funding+total', '#value+funding+total+currency', '#financial+fx', '#activity+id+fts_internal', '#activity+code', '#date+created', '#date+updated']
### (LONG LIST OMMITED) ####
>> HXLMetaExtras: Pandas DataFrame
>>> DataFrame
#date #date+year+budget #description+notes #value+funding+total+usd ... #activity+id+fts_internal #activity+code #date+created #date+updated
0 2020-05-24 NaN Venezuela Migrants Outflows Multiyear 2019 to ... 0 ... 210646 NaN 2020-05-24 2020-07-23
1 2019-11-30 NaN Integral Protection and Humanitarian Assistanc... 0 ... 202767 7F-10139.02 2019-12-10 2020-10-20
2 2019-10-31 NaN Economic Integration of Venezuelan Migrants an... 225990 ... 215375 PC-2020-001 2020-07-23 2020-07-23
[3 rows x 37 columns]
>>> DataFrame.T
0 1 2
#date 2020-05-24 2019-11-30 2019-10-31
#date+year+budget NaN NaN NaN
#description+notes Venezuela Migrants Outflows Multiyear 2019 to ... Integral Protection and Humanitarian Assistanc... Economic Integration of Venezuelan Migrants an...
#value+funding+total+usd 0 0 225990
#org+name+funder European Commission EuropeAid Development and ... Switzerland, Government of United States of America, Government of
#org+type+funder+list Inter-governmental Government Government
#country+iso3+funder+list NaN CHE USA
#date+year+start+funder 2019 2019 2019
#date+year+end+funder 2019 2019 2019
#activity+appeal+name NaN NaN NaN
#activity+appeal+id+external NaN NaN NaN
#activity+appeal+id+fts_internal NaN NaN NaN
#org+name+impl International Organization for Migration Comitato Internationale per lo Sviluppo dei Po... International Labour Organization
#org+type+impl+list UN agency NGO UN agency
#sector+cluster+name+list NaN NaN Multi-sector
#country+iso3+impl+list ABW,ARG,BOL,BRA,CHL,COL,CRI,CUW,DOM,ECU,GUY,ME... ABW,ARG,BOL,BRA,CHL,COL,CRI,CUW,DOM,ECU,GUY,ME... ABW,ARG,BOL,BRA,CHL,COL,CRI,CUW,DOM,ECU,GUY,ME...
#activity+project+name NaN NaN NaN
#activity+project+code NaN NaN NaN
#crisis+name VENEZUELA Outflow - Regional Refugees and Migr... VENEZUELA Outflow - Regional Refugees and Migr... VENEZUELA Outflow - Regional Refugees and Migr...
#date+year+start+impl 2019 2019 2021
#date+year+end+impl 2021 2021 2021
#financial+contribution+type financial financial financial
#financial+contribution+type.1 Parked Parked Standard
#financial+method Traditional aid Traditional aid Traditional aid
#financial+direction incoming incoming incoming
#financial+direction+type shared shared shared
#status+text commitment commitment paid
#date+reported 2020-05-24 2019-12-05 2020-07-23
#date+decision 2019-05-09 NaN 2019-10-31
#description+keywords Multiyear Multiyear NaN
#value+funding+total NaN 0.0 NaN
#value+funding+total+currency NaN CHF NaN
#financial+fx NaN 0.992 NaN
#activity+id+fts_internal 210646 202767 215375
#activity+code NaN 7F-10139.02 PC-2020-001
#date+created 2020-05-24 2019-12-10 2020-07-23
#date+updated 2020-07-23 2020-10-20 2020-07-23
>>> DataFrame.describe
#date+year+budget #value+funding+total+usd #date+year+start+funder #date+year+end+funder ... #date+year+end+impl #value+funding+total #financial+fx #activity+id+fts_internal
count 0.0 3.000000 3.0 3.0 ... 3.0 1.0 1.000 3.000000
mean NaN 75330.000000 2019.0 2019.0 ... 2021.0 0.0 0.992 209596.000000
std NaN 130475.387334 0.0 0.0 ... 0.0 NaN NaN 6369.245717
min NaN 0.000000 2019.0 2019.0 ... 2021.0 0.0 0.992 202767.000000
25% NaN 0.000000 2019.0 2019.0 ... 2021.0 0.0 0.992 206706.500000
50% NaN 0.000000 2019.0 2019.0 ... 2021.0 0.0 0.992 210646.000000
75% NaN 112995.000000 2019.0 2019.0 ... 2021.0 0.0 0.992 213010.500000
max NaN 225990.000000 2019.0 2019.0 ... 2021.0 0.0 0.992 215375.000000
Note: comment https://github.com/EticaAI/HXL-Data-Science-file-formats/issues/6#issuecomment-782917161 also mention difference about work or not with streaming
hxlquickmeta
and classes like HXLMeta
may not be fully compatible with streams / pipesThe way most (if not all, with exception of one related to spec) work with data streams / pipes actually ate very memory and CPU efficient, but is complicated (not because of the HXL or the lib, but because data streams) to have an full overview of an dataset keeping 100% compatibly.
While I'm not that proficient in Python to actually debug this (maybe do exist ways, but I will focus on get things working first), this note to self or to others is to explain that some features of hxlquickmeta may be easier to implement after an file already saved on the local disk.
In general this would not be a problem for something that is not designed to be chained/piped, but at bare minimum is slower than how all other hxl cli commands.
Another implication is that because needs to use temporary files, while (at least for non-Excel or zipped sources I think libhxl may not need at all touch the disk if just pipe commands) is unable to streaming data without touch the disk. This could be an problem in an threat model were you handle sensitive documents in a computer were you have no control over the hardware (and the computer does not have high chance of reuse disk space again). I think that streaming data and just piping the output is possible to not leave traces on temporary files that touch the disk, but with this approach the hxlquickmeta need to save files and even if delete, access to raw bits on disk could access old contents. The Operational System maybe will use memory RAM by default (but the current version does not allow user enforce this).
The natural inclination would be extend hxl.model.Dataset. The reason for I do not do this is mostly because I don't know how to make it compatible with data streaming at this moment (also Python is not one of the languages I have more experience in production).
The second reason is because at least some parts that could change too much behavior of the base classes could break more easily after any libhxl-python update. Python, for example, while programmers use convention with _ to means private method, actually allow access even these methods. So at least on parts that I think could break easily may have some copy and paste and later if need we do some refactoring. One potential advantage of this is that even older versions of these minimum viable products may actually keep working even years later.
Do exist some parts that I think, inspired on Pandas and other Python libraries, I will draft (like abstraction about data types, like the one with temporary name BooleanHXLtype
), but these the libhxl-python at the moment does not have. But even if part of these could be added to the libhxl-python later, maybe they would not be necessary on the average usage and very likely could introduce much more bugs with data already not prepared. Also these new types would increase memory usage, but since libhxl works by default with streamed data, I think this would not be an issue, but the bugs would.
Same comments done here https://github.com/EticaAI/HXL-Data-Science-file-formats/issues/6#issuecomment-808873915 apply to hxlquickmeta
, but to add more:
hxl/hxlquickmeta
do already have usable proof of concept.
hxlquickimport
, so if not all features got refactored, this one I think really worth deploy with pipi packageshxl/hxlquickmeta
and the at that time called 'HXLMeta' become the seed for other more structured approaches.
bin/hxl2example
(that was inspired by the fantastic implementation of the David Megginson on HXLStandard/libhxl-python/hxl/scripts.py!) it's feasible to do more single all in one scripts!Its a bit sad that I did not took time to explain what was the hxlquickmeta, but anyway, we're already going even more deep with HDP #16.
Note to self: Maybe in some weeks I stop like 2~4 days and re-implement the debug features of hxlquickmeta as part of the hxlm.core, but since the target audiences are different (hxlquickmeta is more on the data preparation side), this would need to have some other name than HDP so people would not be scared. I'm not kidding that the hxlquickmeta even had a draft of the hxl2pandas: Pandas DataFrame #4.
One feature of the HXLTabConverter common class #8 (since we're already reading all documentation to see how to make inferences without forcing users to use type hints in all places) actually requires knowing the supposed data types of already HXLated datasets. So, let's break in an separate class [and as much as possible already try to use data structures that could be converted from JSON or something] to create something that actually could make these inferences
The more specific HXL Core hashtags
One advantage of using the hashtag that already is the very own defined on the specification is that the specification for several cases enforce the types. This happens on special for indicators. So, actually, is possible to (at least if is not doing something like brute forcing with the
hxlquickimport
) be somewhat sure about what to expect from the data columns.Which accuracy to aim?
In my personal, honest opinion, >90% of the cases is good enough, including making inferences beyond the official documentation (but at this point may need to do some checking on at least a good amount of rows to deduce. But should exist one way that allows users to explicitly enforce (even if it means a more verbose attribute).
Maybe a different approach to tolerate even less accuracy on first try (think like >75%", maybe less) is if is possible to easily import back the exported format (think the .tab from Orange Data Mining, but could be Weka and others) we assume that the data types and data flags (is meta? This can be ignored? Etc) could already be imported back with more data type hints that if exported again would not change.
In other words: for very long spreadsheets, somewhat already optimized to be corrected on an external program. (I think this is much more likely to happen for data flags than data types, in fact we may need to create some way to allow more than one target variable).
How to warn the suggestions outside what already is strictly defined on the HXL Standard
Also, since already do exist the concept of Debug logs, I think when we try to make inferences on the tags that are less than 90% (or maybe we discover that an analysis of 100 (or up to 10.000) the user literally done poor tagging and this is 98% likely to fail on external data mining tools, we still warn the user (This type of feature would be need if trying to brute force with the hxlquickimport, so at least some quick checks could already exist).