frictionlessdata / datapackage-pipelines

Framework for processing data packages in pipelines of modular components.
https://frictionlessdata.io/
MIT License
117 stars 32 forks source link

Ability to preserve missingValues in dump #177

Closed cschloer closed 4 years ago

cschloer commented 4 years ago

Passing this on to you @roll as I think it will be a pretty big change across multiple repos that will require your codebase knowledge. See previous discussion here: https://github.com/BCODMO/frictionless-usecases/issues/32 as well as previous PRs here: https://github.com/frictionlessdata/tableschema-py/pull/260, https://github.com/frictionlessdata/datapackage-pipelines/pull/175, https://github.com/datahq/dataflows/pull/119.

Sometimes missingValues have meaning beyond "no measurement taken", and sometimes one field will have multiple different kinds of missingValues (such as no measurement taken, measurement below threshold, etc). It's very important that we be able to preserve that data, and up until now we are usually just keeping the field as a string type even though it is clearly a number. It would be great if there was an option to dump_to_path, or a different kind of missingValue parameter, that would show the missingValues properly after the dump.

Solution thoughts:

(this is highest priority for us right now I believe)

roll commented 4 years ago

@cschloer Yea.. something to think about... Not sure, for now, what's the best option

roll commented 4 years ago

I'll try to propose something

roll commented 4 years ago

@cschloer BTH is it 100% neede to have this data in the output files? Because it breaks the idea of a data package concept (I agree with Adam here).

Can we collect this data and export differently:

?

cschloer commented 4 years ago

I'll check in with the rest of the team and get back to you!

roll commented 4 years ago

@cschloer Also the option, I think already mentioned by @akariv:

Raw data:

measurements
1
2
problem1
4
problem2
5

Cleaned data:

measurement (number), measurement_note (string)
1, -
2, -
null, problem1
4, -
null, problem2
5, -

It's also kinda conceptually correct because output data will be strongly typed (e.g. it's possible to load into a SQL database)

So in high-level, to summarize, I think the best options are:

cschloer commented 4 years ago

We're still discussing on our end and probably won't have a response on a preferred solution until Monday, so best to keep this in "waiting for response"

roll commented 4 years ago

@cschloer So regarding another option proposed by @akariv:

Consider we have a processor called handle_commented_numbers which parse a column containing numbers and comments (and marked as number/commented or like this) into this type:

from decimal import Decimal

class CommentedDecimal(Decimal):

    def __new__(cls, value=0):
        try:
            self = Decimal.__new__(cls, value)
            self.comment = None
        except Exception:
            # check if it's in the missing_values
            if not isinstance(value, str):
                raise
            self = Decimal.__new__(cls, 0)
            self.comment = value
        return self

    def __str__(self):
        if self.comment is None:
            return super().__str__()
        return self.comment

print(CommentedDecimal('3'))
# 3
print(CommentedDecimal('problem1'))
# problem1
print(CommentedDecimal('3') + CommentedDecimal('problem1') + 1) # 3 + 0 + 1
# 4

Then in other processors, you can use this column as a number (missing values will be 0) and on a dump stage, we don't need to change anything because a dumper will use the str(value) which returns a number OR a comment.

PR. This approach will require moving parsing from the load processor.

roll commented 4 years ago

I think the last option is exactly what was described in the initial issue. Another question is that proper processing (converting missing values into metadata or other columns/resources) align better with ETL-concepts in my eyes.

So a lot of options to consider. Take your time definitely :smiley:

cschloer commented 4 years ago

So that looks great - but is there any reason that couldn't happen in the load step with a new commentedValue or ignoreValue parameter? And then there's a datatype like this for every type (number, integer, datetime, etc)?

Also I think it would be important that it not resolve as 0 if it's a number but rather as None like it currently does with missingValues. Important when comparing to a value that is actually 0, or other things that mess up when it actually has a value.

roll commented 4 years ago

@cschloer I've investigated all these options more closely and a lot of them have failed some of the checks:

TBH at the moment, I don't see a better option than to have a processor that splits data+comments column into two columns: data + comments. It was described by this example:

Raw data

measurements
1
2
problem1
4
problem2
5

Cleaned data

measurement (number), measurement_note (string)
1, -
2, -
null, problem1
4, -
null, problem2
5, -

Can it work for BCO-DMO?

Actually, I feel now that it's the only proper solution for this problem from the data processing point of view.

roll commented 4 years ago

@cschloer Also, it's easy to modify this solution to one-column based:

Raw data

measurements
1
2
problem1
4
problem2
5

Cleaned data

measurement (object)
{value: 1}
{value: 2}
{value: null, error: problem1}
{value: 4}
{value: null, error: problem2}
{value: 5}

And we even can have a pre-dump processor converting it back to strings as it was in the source

cschloer commented 4 years ago

With the one column solution, how do intermediate processors handle the values? Do they all need to be rewritten to handle a column of type { value: '', error: '' }, or am I misunderstanding?

roll commented 4 years ago

@cschloer (cc @akariv) I've created the first prototype - https://github.com/datahq/dataflows/pull/124

raw data

col1,col2
1,1
err1,2
3,3
4,err2
5,5
mis1,mis2
7,7

pipeline

extractmv:
  title: extractmv
  pipeline:

  - run: load
    parameters:
      format: csv
      from: extractmv.csv
      override_schema:
        missingValues: ['err1', 'err2', 'mis1', 'mis2']
      extract_missing_values:
        target: notes

  - run: dump_to_path
    parameters:
      out-path: 'output'
      pretty-descriptor: true

cleaned data

col1,col2,notes
1,1,{}
,2,"{""col1"": ""err1""}"
3,3,{}
4,,"{""col2"": ""err2""}"
5,5,{}
,,"{""col1"": ""mis1"", ""col2"": ""mis2""}"
7,7,{}

cleaned metadata

{
  "bytes": 149,
  "count_of_rows": 7,
  "hash": "3ca24e4cb672bda65cadbc3f1bee0950",
  "name": "_",
  "profile": "data-package",
  "resources": [
    {
      "bytes": 149,
      "dialect": {
        "caseSensitiveHeader": false,
        "delimiter": ",",
        "doubleQuote": true,
        "header": true,
        "lineTerminator": "\r\n",
        "quoteChar": "\"",
        "skipInitialSpace": false
      },
      "dpp:streamedFrom": "extractmv.csv",
      "dpp:streaming": true,
      "encoding": "utf-8",
      "format": "csv",
      "hash": "6d430994867785f148e9d022995a2ad1",
      "name": "extractmv",
      "path": "extractmv.csv",
      "profile": "tabular-data-resource",
      "schema": {
        "fields": [
          {
            "format": "default",
            "name": "col1",
            "type": "string"
          },
          {
            "format": "default",
            "name": "col2",
            "type": "string"
          },
          {
            "format": "default",
            "name": "notes",
            "type": "object",
            "values": [
              "err1",
              "err2",
              "mis1",
              "mis2"
            ]
          }
        ],
        "missingValues": [
          "err1",
          "err2",
          "mis1",
          "mis2"
        ]
      }
    }
  ]
}

I think a dumper can merge missing values back to their original columns but as metadata linked to a row it could be more useful for further processing in my opinion. For example, a UI can use this field to show some additional information on top of an empty cell (but having knowledge that this values is missing)

cschloer commented 4 years ago

This looks cool! My main concern is the interaction with processors that rename fields. I know there are none in the standard processor library, but in our custom processor library we have a rename fields processor. I suppose it's possible for the rename_fields processor itself to find the "notes" field and update the key names to the new fields but that also seems like a rabbit hole... Furthermore, what happens when you join two resources? Or concatenate two resources with different missingValues (I guess the schema would need to change the notes field to be the union of the previous two note fields)? I feel like there are some complications here that will rapidly make it more and more difficult to implement.

Can you clarify again the problem of subclassing native types? That feels like the only solution that is robust enough to survive a whole pipeline of processing, and also allows the user to choose at the end if they want to dump the dataset with the missingValues or not.

Just had a talk with Amber and Adam about this. Adam is going to reach out to some people in the Ocean sciences community to figure out what exactly they want for datasets like this in order for us to come up with the solution that works best for their needs. I think it's best to hold off on a complete implementation until we hear back from them.

roll commented 4 years ago

@cschloer Hi, somehow missed your answer. I'll investigate this week regarding:

roll commented 4 years ago

Continuing exploring options - https://github.com/datahq/dataflows/pull/125

It extracts missing values to corresponding column's metadata. It doesn't have problems with column renaming but can have problems with data filtering etc because, for now, the POC relies on row number. Though we can use primary keys instead to make it more reliable

def test_extract_missing_values():
    from dataflows import load
    schema = {
        'missingValues': ['err1', 'err2', 'mis1', 'mis2'],
        'fields': [
            {'name': 'col1', 'type': 'number', 'format': 'default'},
            {'name': 'col2', 'type': 'number', 'format': 'default'},
        ]
    }
    flow = Flow(
        load('data/missing_values.csv', override_schema=schema, extract_missing_values=True),
    )
    data, package, stats = flow.results()
    fields = package.descriptor['resources'][0]['schema']['fields']
    assert fields[0]['extractedMissingValues'] == {2: 'err1', 6: 'mis1'}
    assert fields[1]['extractedMissingValues'] == {4: 'err2', 6: 'mis2'}
    assert data == [[
        {'col1': 1, 'col2': 1},
        {'col1': None, 'col2': 2},
        {'col1': 3, 'col2': 3},
        {'col1': 4, 'col2': None},
        {'col1': 5, 'col2': 5},
        {'col1': None, 'col2': None},
        {'col1': 7, 'col2': 7},
    ]]
roll commented 4 years ago

Regarding the subclassing option I still can't even think of any proper implemenation to try. The problem is that:

roll commented 4 years ago

After checking out all these options the idea of having just an option to preserve missing values - https://github.com/datahq/dataflows/pull/119 - doesn't sound as wrong as it felt initially TBH

Though not sure how data with missing values will be used by the client (the UI) but it's OK we think we need to re-consider it. It's already available in tableschema.

roll commented 4 years ago

To summarise the options:

roll commented 4 years ago

Hey @cschloer,

TBH, for now, I've failed to find a simple way to integrate preserve_missing_values or cast_options into dataflows. It seems it requires to be passed in many places and it's not quite clear me how casting works there.

What I can propose at the moment is TABLESCHEMA_PRESERVE_MISSING_VALUES environment variables available in tableschema@1.15 as an experimental API. It globally affects on how tableschema casts data e.g.:

TABLESCHEMA_PRESERVE_MISSING_VALUES=1 dpp run ./tmp/preserve_mv

pipeline


preservemv:
title: preservemv
pipeline:

output

col1,col2
1,1
err1,2
3,3
4,err2
5,5
mis1,mis2
7,7

PS. Probably we can propose an env prop in datapackage_pipelines for setting it in a pipeline.yml spec to make it transferrable.

cschloer commented 4 years ago

Hey,

that environment variable seems to work great. What are the downsides to it?

I think it would be rather important to propose an ENV prop to the pipeline-spec.yaml. Should we make a new issue or do you want to use this one still?

roll commented 4 years ago

@cschloer Regarding hidden obstacles - TBH, I don't know. Preserving missing values is not a tested approach so I guess it will be explored during the work on Laminar. From the tableschema side it seems to be fine because every cast just ignores these missing values as they were nulls.

I've created the env vars issues - https://github.com/frictionlessdata/datapackage-pipelines/issues/181

cschloer commented 4 years ago

This new environment variable fails when used in combination with a datetime and dump_to_path. The serializer in this function (https://github.com/datahq/dataflows/blob/master/dataflows/processors/dumpers/file_formats.py#L62 ) tries to serialize it unless it is None, which runs into problems when it's passed a missingValue that isn't of type datetime. I think adding a "value in missing_values" check here would make it work. @roll

cschloer commented 4 years ago

Something like this should recreate it:

test missingValue date:
  title: test missingValue date
  description: ''
  pipeline:
    - run: load
      cache: false
      parameters:
        format: csv
        from: /path/to/test.csv
        headers: 1
        infer_strategy: strings
        cast_strategy: strings
        override_schema:
          missingValues:
            - ''
            - nd
        name: res
    - run: set_types
      parameters:
        regex: false
        resources:
          - res
        types:
          col4:
            type: datetime
            format: '%m/%d/%y'
    - run: dump_to_path
      parameters:
        out-path: /laminar/test3
        force-format: true
        format: csv

with a test.csv where col4 has a string that looks like %m/%d/%y

cschloer commented 4 years ago

Moving this to top priority

roll commented 4 years ago

Proposed solution - https://github.com/datahq/dataflows/pull/129

roll commented 4 years ago

Hey @akariv @cschloer,

It's been a while we've been working on this issue:

As far as I understand, the only outstanding thing is this PR - https://github.com/frictionlessdata/datapackage-pipelines/issues/181

So asking both or you :smiley:

cschloer commented 4 years ago

I think the PR is important for reproducible specs as you said, but if we want to implement it as Adam specified (global parameters) and then set the environment variable within the relevant processor I think that should also work no problem.

roll commented 4 years ago

I have to confess I didn't understand this thing with global parameters :smiley: How will it set the env var for tableschema?

cschloer commented 4 years ago

I assume it will go along with an update the load processor (or dump processor?) that sets the os.environ properly if it gets that parameter.

Actually this could just as easily be implemented by simply adding a parameter to load that determines whether the environment variable is set in the code itself.

roll commented 4 years ago

@cschloer Have you thought of having a thin DPP wrapper like laminar-cli to handle things like setting env vars etc. Probably it doesn't worth it to solve only this problem but can be useful in many other areas (because you will be able to customize whatever makes sense for your clients e.g. adding goodtables validation command etc). In this case, you also will be able to document everything in one place.

roll commented 4 years ago

Thanks @akariv! - I'll follow up your comments on all the PR

I'll re-open this issue to include also this one - https://github.com/frictionlessdata/datapackage-pipelines/pull/182

roll commented 4 years ago

DONE :tada: (ready to merge in #182)