Closed cschloer closed 4 years ago
@cschloer Yea.. something to think about... Not sure, for now, what's the best option
I'll try to propose something
@cschloer BTH is it 100% neede to have this data in the output files? Because it breaks the idea of a data package concept (I agree with Adam here).
Can we collect this data and export differently:
collected_missing_values: [value, row, column]
?
I'll check in with the rest of the team and get back to you!
@cschloer Also the option, I think already mentioned by @akariv:
Raw data:
measurements
1
2
problem1
4
problem2
5
Cleaned data:
measurement (number), measurement_note (string)
1, -
2, -
null, problem1
4, -
null, problem2
5, -
It's also kinda conceptually correct because output data will be strongly typed (e.g. it's possible to load into a SQL database)
So in high-level, to summarize, I think the best options are:
We're still discussing on our end and probably won't have a response on a preferred solution until Monday, so best to keep this in "waiting for response"
@cschloer So regarding another option proposed by @akariv:
Consider we have a processor
called handle_commented_numbers
which parse a column containing numbers and comments (and marked as number/commented
or like this) into this type:
from decimal import Decimal
class CommentedDecimal(Decimal):
def __new__(cls, value=0):
try:
self = Decimal.__new__(cls, value)
self.comment = None
except Exception:
# check if it's in the missing_values
if not isinstance(value, str):
raise
self = Decimal.__new__(cls, 0)
self.comment = value
return self
def __str__(self):
if self.comment is None:
return super().__str__()
return self.comment
print(CommentedDecimal('3'))
# 3
print(CommentedDecimal('problem1'))
# problem1
print(CommentedDecimal('3') + CommentedDecimal('problem1') + 1) # 3 + 0 + 1
# 4
Then in other processors, you can use this column as a number (missing values will be 0) and on a dump stage, we don't need to change anything because a dumper will use the str(value)
which returns a number OR a comment.
PR.
This approach will require moving parsing from the load
processor.
I think the last option is exactly what was described in the initial issue. Another question is that proper processing (converting missing values into metadata or other columns/resources) align better with ETL-concepts in my eyes.
So a lot of options to consider. Take your time definitely :smiley:
So that looks great - but is there any reason that couldn't happen in the load step with a new commentedValue or ignoreValue parameter? And then there's a datatype like this for every type (number, integer, datetime, etc)?
Also I think it would be important that it not resolve as 0 if it's a number but rather as None like it currently does with missingValues. Important when comparing to a value that is actually 0, or other things that mess up when it actually has a value.
@cschloer I've investigated all these options more closely and a lot of them have failed some of the checks:
TBH at the moment, I don't see a better option than to have a processor that splits data+comments
column into two columns: data
+ comments
. It was described by this example:
Raw data
measurements
1
2
problem1
4
problem2
5
Cleaned data
measurement (number), measurement_note (string)
1, -
2, -
null, problem1
4, -
null, problem2
5, -
Can it work for BCO-DMO?
Actually, I feel now that it's the only proper solution for this problem from the data processing point of view.
@cschloer Also, it's easy to modify this solution to one-column based:
Raw data
measurements
1
2
problem1
4
problem2
5
Cleaned data
measurement (object)
{value: 1}
{value: 2}
{value: null, error: problem1}
{value: 4}
{value: null, error: problem2}
{value: 5}
And we even can have a pre-dump processor converting it back to strings as it was in the source
With the one column solution, how do intermediate processors handle the values? Do they all need to be rewritten to handle a column of type { value: '', error: '' }, or am I misunderstanding?
@cschloer (cc @akariv) I've created the first prototype - https://github.com/datahq/dataflows/pull/124
raw data
col1,col2
1,1
err1,2
3,3
4,err2
5,5
mis1,mis2
7,7
pipeline
extractmv:
title: extractmv
pipeline:
- run: load
parameters:
format: csv
from: extractmv.csv
override_schema:
missingValues: ['err1', 'err2', 'mis1', 'mis2']
extract_missing_values:
target: notes
- run: dump_to_path
parameters:
out-path: 'output'
pretty-descriptor: true
cleaned data
col1,col2,notes
1,1,{}
,2,"{""col1"": ""err1""}"
3,3,{}
4,,"{""col2"": ""err2""}"
5,5,{}
,,"{""col1"": ""mis1"", ""col2"": ""mis2""}"
7,7,{}
cleaned metadata
{
"bytes": 149,
"count_of_rows": 7,
"hash": "3ca24e4cb672bda65cadbc3f1bee0950",
"name": "_",
"profile": "data-package",
"resources": [
{
"bytes": 149,
"dialect": {
"caseSensitiveHeader": false,
"delimiter": ",",
"doubleQuote": true,
"header": true,
"lineTerminator": "\r\n",
"quoteChar": "\"",
"skipInitialSpace": false
},
"dpp:streamedFrom": "extractmv.csv",
"dpp:streaming": true,
"encoding": "utf-8",
"format": "csv",
"hash": "6d430994867785f148e9d022995a2ad1",
"name": "extractmv",
"path": "extractmv.csv",
"profile": "tabular-data-resource",
"schema": {
"fields": [
{
"format": "default",
"name": "col1",
"type": "string"
},
{
"format": "default",
"name": "col2",
"type": "string"
},
{
"format": "default",
"name": "notes",
"type": "object",
"values": [
"err1",
"err2",
"mis1",
"mis2"
]
}
],
"missingValues": [
"err1",
"err2",
"mis1",
"mis2"
]
}
}
]
}
I think a dumper can merge missing values back to their original columns but as metadata linked to a row it could be more useful for further processing in my opinion. For example, a UI can use this field to show some additional information on top of an empty cell (but having knowledge that this values is missing)
This looks cool! My main concern is the interaction with processors that rename fields. I know there are none in the standard processor library, but in our custom processor library we have a rename fields processor. I suppose it's possible for the rename_fields processor itself to find the "notes" field and update the key names to the new fields but that also seems like a rabbit hole... Furthermore, what happens when you join two resources? Or concatenate two resources with different missingValues (I guess the schema would need to change the notes field to be the union of the previous two note fields)? I feel like there are some complications here that will rapidly make it more and more difficult to implement.
Can you clarify again the problem of subclassing native types? That feels like the only solution that is robust enough to survive a whole pipeline of processing, and also allows the user to choose at the end if they want to dump the dataset with the missingValues or not.
Just had a talk with Amber and Adam about this. Adam is going to reach out to some people in the Ocean sciences community to figure out what exactly they want for datasets like this in order for us to come up with the solution that works best for their needs. I think it's best to hold off on a complete implementation until we hear back from them.
@cschloer Hi, somehow missed your answer. I'll investigate this week regarding:
Continuing exploring options - https://github.com/datahq/dataflows/pull/125
It extracts missing values to corresponding column's metadata. It doesn't have problems with column renaming but can have problems with data filtering etc because, for now, the POC relies on row number. Though we can use primary keys instead to make it more reliable
def test_extract_missing_values():
from dataflows import load
schema = {
'missingValues': ['err1', 'err2', 'mis1', 'mis2'],
'fields': [
{'name': 'col1', 'type': 'number', 'format': 'default'},
{'name': 'col2', 'type': 'number', 'format': 'default'},
]
}
flow = Flow(
load('data/missing_values.csv', override_schema=schema, extract_missing_values=True),
)
data, package, stats = flow.results()
fields = package.descriptor['resources'][0]['schema']['fields']
assert fields[0]['extractedMissingValues'] == {2: 'err1', 6: 'mis1'}
assert fields[1]['extractedMissingValues'] == {4: 'err2', 6: 'mis2'}
assert data == [[
{'col1': 1, 'col2': 1},
{'col1': None, 'col2': 2},
{'col1': 3, 'col2': 3},
{'col1': 4, 'col2': None},
{'col1': 5, 'col2': 5},
{'col1': None, 'col2': None},
{'col1': 7, 'col2': 7},
]]
Regarding the subclassing option I still can't even think of any proper implemenation to try. The problem is that:
None
values as a part of it ANDNone
to act like None
because it fails the value is None
checks and breaks Python sematicsAfter checking out all these options the idea of having just an option to preserve missing values - https://github.com/datahq/dataflows/pull/119 - doesn't sound as wrong as it felt initially TBH
Though not sure how data with missing values will be used by the client (the UI) but it's OK we think we need to re-consider it. It's already available in tableschema
.
To summarise the options:
Hey @cschloer,
TBH, for now, I've failed to find a simple way to integrate preserve_missing_values
or cast_options
into dataflows
. It seems it requires to be passed in many places and it's not quite clear me how casting works there.
What I can propose at the moment is TABLESCHEMA_PRESERVE_MISSING_VALUES
environment variables available in tableschema@1.15
as an experimental API. It globally affects on how tableschema
casts data e.g.:
TABLESCHEMA_PRESERVE_MISSING_VALUES=1 dpp run ./tmp/preserve_mv
pipeline
preservemv: title: preservemv pipeline:
run: load parameters: format: csv from: preservemv.csv override_schema: missingValues: ['err1', 'err2', 'mis1', 'mis2'] fields:
run: dump_to_path parameters: out-path: 'output' pretty-descriptor: true
output
col1,col2 1,1 err1,2 3,3 4,err2 5,5 mis1,mis2 7,7
PS.
Probably we can propose an env
prop in datapackage_pipelines
for setting it in a pipeline.yml
spec to make it transferrable.
Hey,
that environment variable seems to work great. What are the downsides to it?
I think it would be rather important to propose an ENV prop to the pipeline-spec.yaml. Should we make a new issue or do you want to use this one still?
@cschloer
Regarding hidden obstacles - TBH, I don't know. Preserving missing values is not a tested approach so I guess it will be explored during the work on Laminar. From the tableschema
side it seems to be fine because every cast just ignores these missing values as they were nulls.
I've created the env vars issues - https://github.com/frictionlessdata/datapackage-pipelines/issues/181
This new environment variable fails when used in combination with a datetime and dump_to_path. The serializer in this function (https://github.com/datahq/dataflows/blob/master/dataflows/processors/dumpers/file_formats.py#L62 ) tries to serialize it unless it is None, which runs into problems when it's passed a missingValue that isn't of type datetime. I think adding a "value in missing_values" check here would make it work. @roll
Something like this should recreate it:
test missingValue date:
title: test missingValue date
description: ''
pipeline:
- run: load
cache: false
parameters:
format: csv
from: /path/to/test.csv
headers: 1
infer_strategy: strings
cast_strategy: strings
override_schema:
missingValues:
- ''
- nd
name: res
- run: set_types
parameters:
regex: false
resources:
- res
types:
col4:
type: datetime
format: '%m/%d/%y'
- run: dump_to_path
parameters:
out-path: /laminar/test3
force-format: true
format: csv
with a test.csv where col4 has a string that looks like %m/%d/%y
Moving this to top priority
Proposed solution - https://github.com/datahq/dataflows/pull/129
Hey @akariv @cschloer,
It's been a while we've been working on this issue:
tableschema
)dataflows
levelAs far as I understand, the only outstanding thing is this PR - https://github.com/frictionlessdata/datapackage-pipelines/issues/181
So asking both or you :smiley:
I think the PR is important for reproducible specs as you said, but if we want to implement it as Adam specified (global parameters) and then set the environment variable within the relevant processor I think that should also work no problem.
I have to confess I didn't understand this thing with global parameters :smiley: How will it set the env var for tableschema
?
I assume it will go along with an update the load processor (or dump processor?) that sets the os.environ properly if it gets that parameter.
Actually this could just as easily be implemented by simply adding a parameter to load that determines whether the environment variable is set in the code itself.
@cschloer
Have you thought of having a thin DPP wrapper like laminar-cli
to handle things like setting env vars etc. Probably it doesn't worth it to solve only this problem but can be useful in many other areas (because you will be able to customize whatever makes sense for your clients e.g. adding goodtables validation command etc). In this case, you also will be able to document everything in one place.
Thanks @akariv! - I'll follow up your comments on all the PR
I'll re-open this issue to include also this one - https://github.com/frictionlessdata/datapackage-pipelines/pull/182
DONE :tada: (ready to merge in #182)
Passing this on to you @roll as I think it will be a pretty big change across multiple repos that will require your codebase knowledge. See previous discussion here: https://github.com/BCODMO/frictionless-usecases/issues/32 as well as previous PRs here: https://github.com/frictionlessdata/tableschema-py/pull/260, https://github.com/frictionlessdata/datapackage-pipelines/pull/175, https://github.com/datahq/dataflows/pull/119.
Sometimes missingValues have meaning beyond "no measurement taken", and sometimes one field will have multiple different kinds of missingValues (such as no measurement taken, measurement below threshold, etc). It's very important that we be able to preserve that data, and up until now we are usually just keeping the field as a string type even though it is clearly a number. It would be great if there was an option to dump_to_path, or a different kind of missingValue parameter, that would show the missingValues properly after the dump.
Solution thoughts:
(this is highest priority for us right now I believe)