frictionlessdata / forum

🗣 Frictionless Data Forum esp for "How do I" type questions
https://frictionlessdata.io/
10 stars 0 forks source link

HXL/HDX to datapackage conversion #22

Closed mcarans closed 4 years ago

mcarans commented 6 years ago

I have started looking at how to convert HXL to a datapackage and made a very simple first example here: https://github.com/mcarans/hxl-frictionless/blob/master/run.py

I would like to extend datapackage to support and specifically identify HXL resources. Is there any issue with adding new attributes "is_hxl" and "hxl_tag" under the resource? eg.:

    "resources": [
        {
            "url": "xxx",
            "is_hxl": true,
            "schema": {
                "fields": [
                    {
                        "name": "year",
                        "type": "date",
                        "hxl_tag": "#date+year"
                    },
                    {
                        "name": "revisedRequirements",
                        "type": "number",
                        "hxl_tag": "#value+funding+required+usd"
                    },

With those attributes then I would consider how to extend datapackage-py to support HXL.

roll commented 6 years ago

@mcarans Hi. It's pretty OK with the specs (and guaranteed to be OK for all further versions) to extend any of descriptor. For example similar solutions for SPSS (some format) - https://github.com/frictionlessdata/tableschema-spss-py#creating-sav-files

Also we have a Storage architecture for tableschema - https://github.com/frictionlessdata/tableschema-py#storage. It allows to export/import data resources and packages to different backends like SQL or Pandas. Not sure if it's applicable for something in HXL/HDX eco-system but you could take a look.

mcarans commented 6 years ago

I have updated the code to use datapackage-py. It reads an HDX dataset and produces a datapackage using HXL type information if available (falling back on datapackage-py's inference)

    converter = Converter()
    converter.convert_hdx_dataset('fts-requirements-and-funding-data-for-afghanistan', 'datapackage.json')

gives:

{
    "profile": "tabular-data-package",
    "description": "FTS publishes data on humanitarian funding flows as reported by donors and recipient organizations. It presents all humanitarian funding to a country and funding that is specifically reported or that can be specifically mapped against funding requirements stated in humanitarian response plans. The data comes from OCHA's [Financial Tracking Service](https://fts.unocha.org/), is encoded as utf-8 and the second row of the CSV contains [HXL](http://hxlstandard.org) tags.",
    "resources": [
        {
            "profile": "tabular-data-resource",
            "format": "csv",
            "path": "http://data.humdata.org/dataset/6a60da4e-253f-474f-8683-7c9ed9a20bf9/resource/892affae-9a91-4f27-8640-4630412e663f/download/fts_funding_afg.csv",
            "encoding": "utf-8",
            "name": "fts_funding_afg.csv",
            "is_hxl": true,
            "schema": {
                "fields": [
                    {
                        "name": "date",
                        "format": "default",
                        "type": "date",
                        "hxl_tag": "#date"
                    },
                    {
                        "name": "budgetYear",
                        "format": "default",
                        "type": "date",
                        "hxl_tag": "#date+year"
                    },
                    {
                        "name": "description",
                        "format": "default",
                        "type": "string",
                        "hxl_tag": "#description+notes"
                    },
                    {
                        "name": "amountUSD",
                        "format": "default",
                        "type": "number",
                        "hxl_tag": "#value+funding+total+usd"
                    },
                    {
                        "name": "organizationName",
                        "format": "default",
                        "type": "string",
                        "hxl_tag": "#org+name"
                    },
                    {
                        "name": "organizationTypes",
                        "format": "default",
                        "type": "string",
                        "hxl_tag": "#org+type"
                    },
                    {
                        "name": "organizationId",
                        "format": "default",
                        "type": "integer",
                        "hxl_tag": "#org+id"
                    },
                    {
                        "name": "contributionType",
                        "format": "default",
                        "type": "number",
                        "hxl_tag": "#value+funding+contribution+type"
                    },
                    {
                        "name": "flowType",
                        "format": "default",
                        "type": "number",
                        "hxl_tag": "#value+funding+contribution+type"
                    },
                    {
                        "name": "method",
                        "format": "default",
                        "type": "number",
                        "hxl_tag": "#value+funding+method"
                    },
                    {
                        "name": "boundary",
                        "format": "default",
                        "type": "number",
                        "hxl_tag": "#value+funding+direction"
                    },
                    {
                        "name": "status",
                        "format": "default",
                        "type": "string",
                        "hxl_tag": "#status+text"
                    },
                    {
                        "name": "firstReportedDate",
                        "format": "default",
                        "type": "date",
                        "hxl_tag": "#date+reported"
                    },
                    {
                        "name": "decisionDate",
                        "format": "default",
                        "type": "date",
                        "hxl_tag": "#date+decision"
                    },
                    {
                        "name": "keywords",
                        "format": "default",
                        "type": "string",
                        "hxl_tag": "#description+keywords"
                    },
                    {
                        "name": "originalAmount",
                        "format": "default",
                        "type": "number",
                        "hxl_tag": "#value+funding+total"
                    },
                    {
                        "name": "originalCurrency",
                        "format": "default",
                        "type": "number",
                        "hxl_tag": "#value+funding+total+currency"
                    },
                    {
                        "name": "exchangeRate",
                        "format": "default",
                        "type": "number",
                        "hxl_tag": "#value+funding+fx"
                    },
                    {
                        "name": "id",
                        "format": "default",
                        "type": "integer",
                        "hxl_tag": "#activity+id+fts_internal"
                    },
                    {
                        "name": "refCode",
                        "format": "default",
                        "type": "string",
                        "hxl_tag": "#activity+code"
                    },
                    {
                        "name": "createdAt",
                        "format": "default",
                        "type": "date",
                        "hxl_tag": "#date+created"
                    },
                    {
                        "name": "updatedAt",
                        "format": "default",
                        "type": "date",
                        "hxl_tag": "#date+updated"
                    }
                ],
                "missingValues": [
                    ""
                ]
            },
            "title": "FTS Funding Data for Afghanistan for 2017",
            "mediatype": "text/csv"
        },
        {
            "profile": "tabular-data-resource",
            "format": "csv",
            "path": "http://data.humdata.org/dataset/6a60da4e-253f-474f-8683-7c9ed9a20bf9/resource/96cc8cda-120f-483a-a7ce-5f89af4d99ee/download/fts_funding_requirements_afg.csv",
            "encoding": "utf-8",
            "name": "fts_funding_requirements_afg.csv",
            "is_hxl": true,
            "schema": {
                "fields": [
                    {
                        "name": "country",
                        "format": "default",
                        "type": "string",
                        "hxl_tag": "#country+name"
                    },
                    {
                        "name": "id",
                        "format": "default",
                        "type": "integer",
                        "hxl_tag": "#activity+appeal+id+fts_internal"
                    },
                    {
                        "name": "name",
                        "format": "default",
                        "type": "string",
                        "hxl_tag": "#activity+appeal+name"
                    },
                    {
                        "name": "code",
                        "format": "default",
                        "type": "string",
                        "hxl_tag": "#activity+appeal+id+external"
                    },
                    {
                        "name": "startDate",
                        "format": "default",
                        "type": "date",
                        "hxl_tag": "#date+start"
                    },
                    {
                        "name": "endDate",
                        "format": "default",
                        "type": "date",
                        "hxl_tag": "#date+end"
                    },
                    {
                        "name": "year",
                        "format": "default",
                        "type": "date",
                        "hxl_tag": "#date+year"
                    },
                    {
                        "name": "revisedRequirements",
                        "format": "default",
                        "type": "number",
                        "hxl_tag": "#value+funding+required+usd"
                    },
                    {
                        "name": "totalFunding",
                        "format": "default",
                        "type": "number",
                        "hxl_tag": "#value+funding+total+usd"
                    },
                    {
                        "name": "percentFunded",
                        "format": "default",
                        "type": "number",
                        "hxl_tag": "#value+funding+pct"
                    }
                ],
                "missingValues": [
                    ""
                ]
            },
            "title": "FTS Requirements and Funding Data for Afghanistan",
            "mediatype": "text/csv"
        },
        {
            "profile": "tabular-data-resource",
            "format": "csv",
            "path": "http://data.humdata.org/dataset/6a60da4e-253f-474f-8683-7c9ed9a20bf9/resource/45dc4269-405a-433d-9011-d1ae23d624a5/download/fts_funding_cluster_afg.csv",
            "encoding": "utf-8",
            "name": "fts_funding_cluster_afg.csv",
            "is_hxl": true,
            "schema": {
                "fields": [
                    {
                        "name": "country",
                        "format": "default",
                        "type": "string",
                        "hxl_tag": "#country+name"
                    },
                    {
                        "name": "id",
                        "format": "default",
                        "type": "integer",
                        "hxl_tag": "#activity+appeal+id+fts_internal"
                    },
                    {
                        "name": "name",
                        "format": "default",
                        "type": "string",
                        "hxl_tag": "#activity+appeal+name"
                    },
                    {
                        "name": "code",
                        "format": "default",
                        "type": "string",
                        "hxl_tag": "#activity+appeal+id+external"
                    },
                    {
                        "name": "startDate",
                        "format": "default",
                        "type": "date",
                        "hxl_tag": "#date+start"
                    },
                    {
                        "name": "endDate",
                        "format": "default",
                        "type": "date",
                        "hxl_tag": "#date+end"
                    },
                    {
                        "name": "year",
                        "format": "default",
                        "type": "date",
                        "hxl_tag": "#date+year"
                    },
                    {
                        "name": "totalFunding",
                        "format": "default",
                        "type": "number",
                        "hxl_tag": "#value+funding+total+usd"
                    },
                    {
                        "name": "clusterCode",
                        "format": "default",
                        "type": "string",
                        "hxl_tag": "#sector+code"
                    },
                    {
                        "name": "clusterName",
                        "format": "default",
                        "type": "string",
                        "hxl_tag": "#sector+name"
                    }
                ],
                "missingValues": [
                    ""
                ]
            },
            "title": "FTS Funding Data by Cluster for Afghanistan",
            "mediatype": "text/csv"
        }
    ],
    "name": "fts-requirements-and-funding-data-for-afghanistan",
    "title": "Afghanistan - Requirements and Funding Data",
    "id": "6a60da4e-253f-474f-8683-7c9ed9a20bf9"
}
mcarans commented 6 years ago

One thing I just realised is I need to be able to tell infer to ignore the HXL tags (on the row after the headers) although it seemed to do a reasonable job in any case

roll commented 6 years ago

Wow. That's great

mcarans commented 6 years ago

Thanks. I am trying to think how we could make Frictionless tools HXL aware. What I mean is simply in handling the second header row containing the HXL tags - not treating the tags as data eg. when validating a column, inferring column type etc. This will enable us to use the Frictionless tools on datasets converted to datapackages broadly as I have done above. What do you think?

roll commented 6 years ago

@mcarans On table level we support skip_rows argument. So we could use it like this to skip second row of a data source:

table = Table(data, schema=schmea, skip_rows=[2])

It's not a part of the Data Package standard for now but as a concrete implementation datapackage-py still could add support for it:

package = Package({'resources': [{'name': 'name', 'path': 'path', 'skipRows': [2]]})

I suppose it will solve HXL tags issue?

mcarans commented 6 years ago

@roll That would be a good place to start.

Looking further ahead, I am pondering whether it's possible or worthwhile to make a tighter integration between Package (and Resource in particular) and HXL at least in the concrete implementation, but not sure how easy to plug into existing architecture eg. 'hxlTags': [2] instead of skipRows. But let's just go with what you had in mind for now.

rufuspollock commented 4 years ago

FIXED. Looks like this largely got resolved 😄