`hxl-yml-spec-to-hxl-json-spec`: HXL Data processing specs exporter

fititnt commented 3 years ago

Quick links:

"JSON processing specs for HXL data, David Megginson, 2021-03-11"
- https://docs.google.com/presentation/d/17vXOnq2atIDnrODGLs36P1EaUvT-vXPjsc2I1q1Qc50/edit#slide=id.p
Test online
- https://proxy.hxlstandard.org/api/from-spec.html

Let's do an proof of concept of the thing!

fititnt commented 3 years ago

Maybe this

hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml

and an file like this

- hsilo: "test1"
  hrecipe:
    - id: recipe1
      source:
        - iri: https://docs.google.com/spreadsheets/d/12k4BWqq5c3mV9ihQscPIwtuDa_QRB-iFohO7dXSSptI/edit#gid=0
          filters:
            - filter: with_columns
              with_columns: "#vocab+id+v_iso6393_3letter,#vocab+code+v_6391,#vocab+name"
            - filter: without_rows
              without_rows: "#vocab+code+v_6391="

could be a good starting point. But form my experience with Ansible (and very, very large Ansible playbooks) we could do from start allow parsing several YAML files at once and just output all the json specs line by line.

But the come to this point, the hdpcli needs to implement some way to at least concatenate more than one YAML file. (the part about include_file options may be something for later).

fititnt commented 3 years ago

This is the yaml file (there is some extra markup, but ignore for now).

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ cat tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml | grep "^[^#;]"

---
- hsilo:
    name: "test1"
    desc: from https://docs.google.com/presentation/d/17vXOnq2atIDnrODGLs36P1EaUvT-vXPjsc2I1q1Qc50/
  hrecipe:
    - id: example-processing-with-a-JSON-spec
      iri_example:
        - iri: https://data.humdata.org/dataset/yemen-humanitarian-needs-overview
          sheet_index: 1
      recipe:
        - filter: count
          patterns: "adm1+name,adm1+code"
          aggregators:
            - "sum(population) as Population#population"
        - filter: clean_data
          number: "population"
          number_format: .0f

This is the json spec result

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml

{
    "input": "https://data.humdata.org/dataset/yemen-humanitarian-needs-overview",
    "recipe": [
        {
            "aggregators": [
                "sum(population) as Population#population"
            ],
            "filter": "count",
            "patterns": "adm1+name,adm1+code"
        },
        {
            "filter": "clean_data",
            "number": "population",
            "number_format": ".0f"
        }
    ],
    "sheet_index": 1
}

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml | hxlspec

ERROR (hxl.io): Skipping column(s) with malformed hashtag specs: #
Gov,Gov Pcode,Population
#adm1+name,#adm1+code,#population
Abyan,YE12,618892
Ad Dali',YE30,818507
Aden,YE24,1053455
Al Bayda,YE14,795107
Al Hodeidah,YE18,2996334
Al Jawf,YE16,633596
Al Maharah,YE28,175606
Al Mahwit,YE27,770920
Amran,YE29,1221908
Dhamar,YE20,2194159
Hadramawt,YE19,1551347
Hajjah,YE17,2630678
Ibb,YE11,3143818
Lahj,YE25,1076296
Ma'rib,YE26,1086663
Raymah,YE31,562930
Sa'dah,YE22,934201
Sana'a,YE23,1370798
Sana'a City,YE13,3296342
Shabwah,YE21,676408
Socotra,YE32,69004
Ta'iz,YE15,3104579

And actually, redirecting to command line hxlspec actually worked. Just a quick warning, but worked!

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ pip3 show libhxl | grep Version

Version: 4.22

fititnt commented 3 years ago

We need some way to make 'inline' data tables that could work an some way to test if an HXL Data processing specs is working (and needs to work offline).

This implies add some new attributes, in special the concept of inline data and expected result data. Or maybe the concept of 'example'.

fititnt commented 3 years ago

Current example

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ cat tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml

# yaml-language-server: $schema=https://raw.githubusercontent.com/EticaAI/HXL-Data-Science-file-formats/main/hxlm/core/schema/hdp.json-schema.json

# How to run this file? Version tested: v0.7.4
# @see https://github.com/EticaAI/HXL-Data-Science-file-formats/issues/14#issuecomment-798454298

# To inspect the result (pretty print)
#     hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml
# To pipe the result direct to hxlspec (first item of array, use jq '.[0]')
#     hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml | jq '.[0]' | hxlspec
# To pipe the result direct to hxlspec (first item of array, use jq '.[1]')
#     hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml | jq '.[1]' | hxlspec

---

# See also https://proxy.hxlstandard.org/api/from-spec.html
# http://json-schema.org/understanding-json-schema/
# Test schema online https://www.jsonschemavalidator.net/
# Validate schema here: https://www.json-schema-linter.com/
# TODO: better validate HERE https://jsonschemalint.com/#!/version/draft-07/markup/json

- hsilo: "test1"
  hrecipe:
    - id: recipe1
      _recipe:
        - filter: with_columns
          includes: "#vocab+id+v_iso6393_3letter,#vocab+code+v_6391,#vocab+name"
        - filter: without_rows
          queries: "#vocab+code+v_6391="
      exemplum:
        - fontem:
            iri: https://docs.google.com/spreadsheets/d/12k4BWqq5c3mV9ihQscPIwtuDa_QRB-iFohO7dXSSptI/edit#gid=0

- hsilo: 
    nomen: "test1"
    descriptionem: from https://docs.google.com/presentation/d/17vXOnq2atIDnrODGLs36P1EaUvT-vXPjsc2I1q1Qc50/
  hrecipe:
    - id: example-processing-with-a-JSON-spec
      _recipe:
        - filter: count
          patterns: "adm1+name,adm1+code"
          aggregators:
            - "sum(population) as Population#population"
        - filter: clean_data
          number: "population"
          number_format: .0f
      exemplum:
        - fontem:
            iri: https://data.humdata.org/dataset/yemen-humanitarian-needs-overview
            _sheet_index: 1

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml

[
    {
        "input": "https://docs.google.com/spreadsheets/d/12k4BWqq5c3mV9ihQscPIwtuDa_QRB-iFohO7dXSSptI/edit#gid=0",
        "recipe": [
            {
                "filter": "with_columns",
                "includes": "#vocab+id+v_iso6393_3letter,#vocab+code+v_6391,#vocab+name"
            },
            {
                "filter": "without_rows",
                "queries": "#vocab+code+v_6391="
            }
        ]
    },
    {
        "input": "https://data.humdata.org/dataset/yemen-humanitarian-needs-overview",
        "recipe": [
            {
                "aggregators": [
                    "sum(population) as Population#population"
                ],
                "filter": "count",
                "patterns": "adm1+name,adm1+code"
            },
            {
                "filter": "clean_data",
                "number": "population",
                "number_format": ".0f"
            }
        ]
    }
]

fititnt commented 3 years ago

Now the hdpcli --export-to-hxl-json-processing-spec, to generate the input parameter specified by the HXL data processing specs, should be as first item of an array. If using the internal language, this means put in something like hrecipe.[0].exemplum.[0].fontem.iri instead of hrecipe.[0].iri_example.[0].iri.

The idea of use 'exemplum' is because if one goal of recipes would be reusability, this means that any input data there would be... just as example/reference.

The inpact of this is that now, the first item when exporting will always be without example inputs, but the second one would be like before.

Both 'input_data' / 'output_data' actually are one way to express, as inline data, input data (to not use external link) and 'output_data' is if eventually we implement some way to use an recipe to be able to be tested on different proxies.

Also, the idea of 'input_data' / 'output_data', even if ignored by HXL data processing specs, can be used to just looking at one YAML file have an idea of what the recipe would do. (ok that the idea is actually test if really works, but at least for documentation it already serve!)

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats/tests/hrecipe$ cat hello-world.hrecipe.hdp.yml

# cd tests/hrecipe
# hdpcli --export-to-hxl-json-processing-specs hello-world.hrecipe.hdp.yml
# hdpcli --export-to-hxl-json-processing-specs hello-world.hrecipe.hdp.yml | jq '.[1]' | hxlspec
---
- hsilo:
    nomen: hello-world.hrecipe.hdp.yml
    linguam: mul # https://iso639-3.sil.org/code/mul
  hrecipe:
    - id: example-processing-with-a-JSON-spec
      _recipe:
        - filter: count
          patterns: "adm1+name,adm1+code"
          aggregators:
            - "sum(population) as Population#population"
        - filter: clean_data
          number: "population"
          number_format: .0f
      # iri_example:
      #   - iri: https://data.humdata.org/dataset/yemen-humanitarian-needs-overview
      #     sheet_index: 1
      exemplum:
        # Example one
        - fontem:
            iri: https://data.humdata.org/dataset/yemen-humanitarian-needs-overview
            _sheet_index: 1

        # Example two includes both an inline data
        - fontem:
            # Note: fontem.datum not fully implemented. But the idea here is
            #       be able to create an ad-hoc table instead of use
            #       external input. So help show as quick example or...
            #       as some sort of unitary test for an HXL data processing
            #       spec!
            datum:
              - ["header 1", "header 2", "header 3"]
              - ["#item +id", "#item +name", "#item +value"]
              - ["ACME1", "ACME Inc.", "123"]
              - ["XPTO1", "XPTO org", "456"]
          objectivum:
            # Note: fontem.objectivum not fully implemented. But the idea here
            #       is (like the fontem.datum) work as ad-hoc table, but is
            #       really allow create some sort of unitary test for a HXL
            #       data processing spec!
            datum:
              - ["header 1", "header 2", "header 3"]
              - ["#item +id", "#item +name", "#item +value"]
              - ["ACME1", "ACME Inc.", "123"]
              - ["XPTO1", "XPTO org", "456"]

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats/tests/hrecipe$ hdpcli --export-to-hxl-json-processing-specs hello-world.hrecipe.hdp.yml

[
    {
        "recipe": [
            {
                "aggregators": [
                    "sum(population) as Population#population"
                ],
                "filter": "count",
                "patterns": "adm1+name,adm1+code"
            },
            {
                "filter": "clean_data",
                "number": "population",
                "number_format": ".0f"
            }
        ]
    },
    {
        "input": "https://data.humdata.org/dataset/yemen-humanitarian-needs-overview",
        "recipe": [
            {
                "aggregators": [
                    "sum(population) as Population#population"
                ],
                "filter": "count",
                "patterns": "adm1+name,adm1+code"
            },
            {
                "filter": "clean_data",
                "number": "population",
                "number_format": ".0f"
            }
        ],
        "sheet_index": 1
    },
    {
        "input_data": [
            [
                "header 1",
                "header 2",
                "header 3"
            ],
            [
                "#item +id",
                "#item +name",
                "#item +value"
            ],
            [
                "ACME1",
                "ACME Inc.",
                "123"
            ],
            [
                "XPTO1",
                "XPTO org",
                "456"
            ]
        ],
        "output_data": [
            [
                "header 1",
                "header 2",
                "header 3"
            ],
            [
                "#item +id",
                "#item +name",
                "#item +value"
            ],
            [
                "ACME1",
                "ACME Inc.",
                "123"
            ],
            [
                "XPTO1",
                "XPTO org",
                "456"
            ]
        ],
        "recipe": [
            {
                "aggregators": [
                    "sum(population) as Population#population"
                ],
                "filter": "count",
                "patterns": "adm1+name,adm1+code"
            },
            {
                "filter": "clean_data",
                "number": "population",
                "number_format": ".0f"
            }
        ]
    }
]

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ cat tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml

# yaml-language-server: $schema=https://raw.githubusercontent.com/EticaAI/HXL-Data-Science-file-formats/main/hxlm/core/schema/hdp.json-schema.json

# How to run this file? Version tested: v0.7.4
# @see https://github.com/EticaAI/HXL-Data-Science-file-formats/issues/14#issuecomment-798454298

# To inspect the result (pretty print)
#     hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml
# To pipe the result direct to hxlspec (second item of array, use jq '.[1]')
#     hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml | jq '.[1]' | hxlspec
# To pipe the result direct to hxlspec (4º item of array, use jq '.[1]')
#     hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml | jq '.[3]' | hxlspec

---

# See also https://proxy.hxlstandard.org/api/from-spec.html
# http://json-schema.org/understanding-json-schema/
# Test schema online https://www.jsonschemavalidator.net/
# Validate schema here: https://www.json-schema-linter.com/
# TODO: better validate HERE https://jsonschemalint.com/#!/version/draft-07/markup/json

- hsilo: "test1"
  hrecipe:
    - id: recipe1
      _recipe:
        - filter: with_columns
          includes: "#vocab+id+v_iso6393_3letter,#vocab+code+v_6391,#vocab+name"
        - filter: without_rows
          queries: "#vocab+code+v_6391="
      exemplum:
        - fontem:
            iri: https://docs.google.com/spreadsheets/d/12k4BWqq5c3mV9ihQscPIwtuDa_QRB-iFohO7dXSSptI/edit#gid=0

- hsilo: 
    nomen: "test1"
    descriptionem: from https://docs.google.com/presentation/d/17vXOnq2atIDnrODGLs36P1EaUvT-vXPjsc2I1q1Qc50/
  hrecipe:
    - id: example-processing-with-a-JSON-spec
      _recipe:
        - filter: count
          patterns: "adm1+name,adm1+code"
          aggregators:
            - "sum(population) as Population#population"
        - filter: clean_data
          number: "population"
          number_format: .0f
      exemplum:
        - fontem:
            iri: https://data.humdata.org/dataset/yemen-humanitarian-needs-overview
            _sheet_index: 1

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ hdpcli --export-to-hxl-json-processing-specs tests/hxl-processing-specs/hxl-processing-specs-test-01.hdp.yml

[
    {
        "recipe": [
            {
                "filter": "with_columns",
                "includes": "#vocab+id+v_iso6393_3letter,#vocab+code+v_6391,#vocab+name"
            },
            {
                "filter": "without_rows",
                "queries": "#vocab+code+v_6391="
            }
        ]
    },
    {
        "input": "https://docs.google.com/spreadsheets/d/12k4BWqq5c3mV9ihQscPIwtuDa_QRB-iFohO7dXSSptI/edit#gid=0",
        "recipe": [
            {
                "filter": "with_columns",
                "includes": "#vocab+id+v_iso6393_3letter,#vocab+code+v_6391,#vocab+name"
            },
            {
                "filter": "without_rows",
                "queries": "#vocab+code+v_6391="
            }
        ]
    },
    {
        "recipe": [
            {
                "aggregators": [
                    "sum(population) as Population#population"
                ],
                "filter": "count",
                "patterns": "adm1+name,adm1+code"
            },
            {
                "filter": "clean_data",
                "number": "population",
                "number_format": ".0f"
            }
        ]
    },
    {
        "input": "https://data.humdata.org/dataset/yemen-humanitarian-needs-overview",
        "recipe": [
            {
                "aggregators": [
                    "sum(population) as Population#population"
                ],
                "filter": "count",
                "patterns": "adm1+name,adm1+code"
            },
            {
                "filter": "clean_data",
                "number": "population",
                "number_format": ".0f"
            }
        ],
        "sheet_index": 1
    }
]

EticaAI / HXL-Data-Science-file-formats