Open fititnt opened 3 years ago
Maybe this
hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml
and an file like this
- hsilo: "test1"
hrecipe:
- id: recipe1
source:
- iri: https://docs.google.com/spreadsheets/d/12k4BWqq5c3mV9ihQscPIwtuDa_QRB-iFohO7dXSSptI/edit#gid=0
filters:
- filter: with_columns
with_columns: "#vocab+id+v_iso6393_3letter,#vocab+code+v_6391,#vocab+name"
- filter: without_rows
without_rows: "#vocab+code+v_6391="
could be a good starting point. But form my experience with Ansible (and very, very large Ansible playbooks) we could do from start allow parsing several YAML files at once and just output all the json specs line by line.
But the come to this point, the hdpcli needs to implement some way to at least concatenate more than one YAML file. (the part about include_file options may be something for later).
This is the yaml file (there is some extra markup, but ignore for now).
---
- hsilo:
name: "test1"
desc: from https://docs.google.com/presentation/d/17vXOnq2atIDnrODGLs36P1EaUvT-vXPjsc2I1q1Qc50/
hrecipe:
- id: example-processing-with-a-JSON-spec
iri_example:
- iri: https://data.humdata.org/dataset/yemen-humanitarian-needs-overview
sheet_index: 1
recipe:
- filter: count
patterns: "adm1+name,adm1+code"
aggregators:
- "sum(population) as Population#population"
- filter: clean_data
number: "population"
number_format: .0f
This is the json spec result
{
"input": "https://data.humdata.org/dataset/yemen-humanitarian-needs-overview",
"recipe": [
{
"aggregators": [
"sum(population) as Population#population"
],
"filter": "count",
"patterns": "adm1+name,adm1+code"
},
{
"filter": "clean_data",
"number": "population",
"number_format": ".0f"
}
],
"sheet_index": 1
}
ERROR (hxl.io): Skipping column(s) with malformed hashtag specs: #
Gov,Gov Pcode,Population
#adm1+name,#adm1+code,#population
Abyan,YE12,618892
Ad Dali',YE30,818507
Aden,YE24,1053455
Al Bayda,YE14,795107
Al Hodeidah,YE18,2996334
Al Jawf,YE16,633596
Al Maharah,YE28,175606
Al Mahwit,YE27,770920
Amran,YE29,1221908
Dhamar,YE20,2194159
Hadramawt,YE19,1551347
Hajjah,YE17,2630678
Ibb,YE11,3143818
Lahj,YE25,1076296
Ma'rib,YE26,1086663
Raymah,YE31,562930
Sa'dah,YE22,934201
Sana'a,YE23,1370798
Sana'a City,YE13,3296342
Shabwah,YE21,676408
Socotra,YE32,69004
Ta'iz,YE15,3104579
And actually, redirecting to command line hxlspec actually worked. Just a quick warning, but worked!
Version: 4.22
We need some way to make 'inline' data tables that could work an some way to test if an HXL Data processing specs is working (and needs to work offline).
This implies add some new attributes, in special the concept of inline data and expected result data. Or maybe the concept of 'example'.
Current example
# yaml-language-server: $schema=https://raw.githubusercontent.com/EticaAI/HXL-Data-Science-file-formats/main/hxlm/core/schema/hdp.json-schema.json
# How to run this file? Version tested: v0.7.4
# @see https://github.com/EticaAI/HXL-Data-Science-file-formats/issues/14#issuecomment-798454298
# To inspect the result (pretty print)
# hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml
# To pipe the result direct to hxlspec (first item of array, use jq '.[0]')
# hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml | jq '.[0]' | hxlspec
# To pipe the result direct to hxlspec (first item of array, use jq '.[1]')
# hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml | jq '.[1]' | hxlspec
---
# See also https://proxy.hxlstandard.org/api/from-spec.html
# http://json-schema.org/understanding-json-schema/
# Test schema online https://www.jsonschemavalidator.net/
# Validate schema here: https://www.json-schema-linter.com/
# TODO: better validate HERE https://jsonschemalint.com/#!/version/draft-07/markup/json
- hsilo: "test1"
hrecipe:
- id: recipe1
_recipe:
- filter: with_columns
includes: "#vocab+id+v_iso6393_3letter,#vocab+code+v_6391,#vocab+name"
- filter: without_rows
queries: "#vocab+code+v_6391="
exemplum:
- fontem:
iri: https://docs.google.com/spreadsheets/d/12k4BWqq5c3mV9ihQscPIwtuDa_QRB-iFohO7dXSSptI/edit#gid=0
- hsilo:
nomen: "test1"
descriptionem: from https://docs.google.com/presentation/d/17vXOnq2atIDnrODGLs36P1EaUvT-vXPjsc2I1q1Qc50/
hrecipe:
- id: example-processing-with-a-JSON-spec
_recipe:
- filter: count
patterns: "adm1+name,adm1+code"
aggregators:
- "sum(population) as Population#population"
- filter: clean_data
number: "population"
number_format: .0f
exemplum:
- fontem:
iri: https://data.humdata.org/dataset/yemen-humanitarian-needs-overview
_sheet_index: 1
[
{
"input": "https://docs.google.com/spreadsheets/d/12k4BWqq5c3mV9ihQscPIwtuDa_QRB-iFohO7dXSSptI/edit#gid=0",
"recipe": [
{
"filter": "with_columns",
"includes": "#vocab+id+v_iso6393_3letter,#vocab+code+v_6391,#vocab+name"
},
{
"filter": "without_rows",
"queries": "#vocab+code+v_6391="
}
]
},
{
"input": "https://data.humdata.org/dataset/yemen-humanitarian-needs-overview",
"recipe": [
{
"aggregators": [
"sum(population) as Population#population"
],
"filter": "count",
"patterns": "adm1+name,adm1+code"
},
{
"filter": "clean_data",
"number": "population",
"number_format": ".0f"
}
]
}
]
Now the hdpcli --export-to-hxl-json-processing-spec
, to generate the input
parameter specified by the HXL data processing specs, should be as first item of an array. If using the internal language, this means put in something like hrecipe.[0].exemplum.[0].fontem.iri
instead of hrecipe.[0].iri_example.[0].iri
.
The idea of use 'exemplum' is because if one goal of recipes would be reusability, this means that any input data there would be... just as example/reference.
The inpact of this is that now, the first item when exporting will always be without example inputs, but the second one would be like before.
Both 'input_data' / 'output_data' actually are one way to express, as inline data, input data (to not use external link) and 'output_data' is if eventually we implement some way to use an recipe to be able to be tested on different proxies.
Also, the idea of 'input_data' / 'output_data', even if ignored by HXL data processing specs, can be used to just looking at one YAML file have an idea of what the recipe would do. (ok that the idea is actually test if really works, but at least for documentation it already serve!)
# cd tests/hrecipe
# hdpcli --export-to-hxl-json-processing-specs hello-world.hrecipe.hdp.yml
# hdpcli --export-to-hxl-json-processing-specs hello-world.hrecipe.hdp.yml | jq '.[1]' | hxlspec
---
- hsilo:
nomen: hello-world.hrecipe.hdp.yml
linguam: mul # https://iso639-3.sil.org/code/mul
hrecipe:
- id: example-processing-with-a-JSON-spec
_recipe:
- filter: count
patterns: "adm1+name,adm1+code"
aggregators:
- "sum(population) as Population#population"
- filter: clean_data
number: "population"
number_format: .0f
# iri_example:
# - iri: https://data.humdata.org/dataset/yemen-humanitarian-needs-overview
# sheet_index: 1
exemplum:
# Example one
- fontem:
iri: https://data.humdata.org/dataset/yemen-humanitarian-needs-overview
_sheet_index: 1
# Example two includes both an inline data
- fontem:
# Note: fontem.datum not fully implemented. But the idea here is
# be able to create an ad-hoc table instead of use
# external input. So help show as quick example or...
# as some sort of unitary test for an HXL data processing
# spec!
datum:
- ["header 1", "header 2", "header 3"]
- ["#item +id", "#item +name", "#item +value"]
- ["ACME1", "ACME Inc.", "123"]
- ["XPTO1", "XPTO org", "456"]
objectivum:
# Note: fontem.objectivum not fully implemented. But the idea here
# is (like the fontem.datum) work as ad-hoc table, but is
# really allow create some sort of unitary test for a HXL
# data processing spec!
datum:
- ["header 1", "header 2", "header 3"]
- ["#item +id", "#item +name", "#item +value"]
- ["ACME1", "ACME Inc.", "123"]
- ["XPTO1", "XPTO org", "456"]
[
{
"recipe": [
{
"aggregators": [
"sum(population) as Population#population"
],
"filter": "count",
"patterns": "adm1+name,adm1+code"
},
{
"filter": "clean_data",
"number": "population",
"number_format": ".0f"
}
]
},
{
"input": "https://data.humdata.org/dataset/yemen-humanitarian-needs-overview",
"recipe": [
{
"aggregators": [
"sum(population) as Population#population"
],
"filter": "count",
"patterns": "adm1+name,adm1+code"
},
{
"filter": "clean_data",
"number": "population",
"number_format": ".0f"
}
],
"sheet_index": 1
},
{
"input_data": [
[
"header 1",
"header 2",
"header 3"
],
[
"#item +id",
"#item +name",
"#item +value"
],
[
"ACME1",
"ACME Inc.",
"123"
],
[
"XPTO1",
"XPTO org",
"456"
]
],
"output_data": [
[
"header 1",
"header 2",
"header 3"
],
[
"#item +id",
"#item +name",
"#item +value"
],
[
"ACME1",
"ACME Inc.",
"123"
],
[
"XPTO1",
"XPTO org",
"456"
]
],
"recipe": [
{
"aggregators": [
"sum(population) as Population#population"
],
"filter": "count",
"patterns": "adm1+name,adm1+code"
},
{
"filter": "clean_data",
"number": "population",
"number_format": ".0f"
}
]
}
]
# yaml-language-server: $schema=https://raw.githubusercontent.com/EticaAI/HXL-Data-Science-file-formats/main/hxlm/core/schema/hdp.json-schema.json
# How to run this file? Version tested: v0.7.4
# @see https://github.com/EticaAI/HXL-Data-Science-file-formats/issues/14#issuecomment-798454298
# To inspect the result (pretty print)
# hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml
# To pipe the result direct to hxlspec (second item of array, use jq '.[1]')
# hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml | jq '.[1]' | hxlspec
# To pipe the result direct to hxlspec (4º item of array, use jq '.[1]')
# hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml | jq '.[3]' | hxlspec
---
# See also https://proxy.hxlstandard.org/api/from-spec.html
# http://json-schema.org/understanding-json-schema/
# Test schema online https://www.jsonschemavalidator.net/
# Validate schema here: https://www.json-schema-linter.com/
# TODO: better validate HERE https://jsonschemalint.com/#!/version/draft-07/markup/json
- hsilo: "test1"
hrecipe:
- id: recipe1
_recipe:
- filter: with_columns
includes: "#vocab+id+v_iso6393_3letter,#vocab+code+v_6391,#vocab+name"
- filter: without_rows
queries: "#vocab+code+v_6391="
exemplum:
- fontem:
iri: https://docs.google.com/spreadsheets/d/12k4BWqq5c3mV9ihQscPIwtuDa_QRB-iFohO7dXSSptI/edit#gid=0
- hsilo:
nomen: "test1"
descriptionem: from https://docs.google.com/presentation/d/17vXOnq2atIDnrODGLs36P1EaUvT-vXPjsc2I1q1Qc50/
hrecipe:
- id: example-processing-with-a-JSON-spec
_recipe:
- filter: count
patterns: "adm1+name,adm1+code"
aggregators:
- "sum(population) as Population#population"
- filter: clean_data
number: "population"
number_format: .0f
exemplum:
- fontem:
iri: https://data.humdata.org/dataset/yemen-humanitarian-needs-overview
_sheet_index: 1
[
{
"recipe": [
{
"filter": "with_columns",
"includes": "#vocab+id+v_iso6393_3letter,#vocab+code+v_6391,#vocab+name"
},
{
"filter": "without_rows",
"queries": "#vocab+code+v_6391="
}
]
},
{
"input": "https://docs.google.com/spreadsheets/d/12k4BWqq5c3mV9ihQscPIwtuDa_QRB-iFohO7dXSSptI/edit#gid=0",
"recipe": [
{
"filter": "with_columns",
"includes": "#vocab+id+v_iso6393_3letter,#vocab+code+v_6391,#vocab+name"
},
{
"filter": "without_rows",
"queries": "#vocab+code+v_6391="
}
]
},
{
"recipe": [
{
"aggregators": [
"sum(population) as Population#population"
],
"filter": "count",
"patterns": "adm1+name,adm1+code"
},
{
"filter": "clean_data",
"number": "population",
"number_format": ".0f"
}
]
},
{
"input": "https://data.humdata.org/dataset/yemen-humanitarian-needs-overview",
"recipe": [
{
"aggregators": [
"sum(population) as Population#population"
],
"filter": "count",
"patterns": "adm1+name,adm1+code"
},
{
"filter": "clean_data",
"number": "population",
"number_format": ".0f"
}
],
"sheet_index": 1
}
]
Quick links:
Let's do an proof of concept of the thing!