Closed adyork closed 4 years ago
Hey @lwinfree @roll is there any update on this issue in particular? I think this is a rather important one for us. Specifically the issue is that dump_to_path is forcing a particular datetime format . A quick fix for us has been to fork dataflows and change that constant to be what we need (%Y-%m-%dT%H:%M:%SZ
) but that isn't a great long term solution.
Is there any way the dumpers can output the datetime fields using the format specified within the datapackage?
Hi @cschloer, I was on Holiday. Now I'm back and I will prioritize this one.
@cschloer So this one is barely a subset of #19. Could we please return to #19 for a sec, because I, actually, wanted you to take a look/brainstorm possible API solution on your end for custom data/time formats.
To be re-checked after #19 is done
It's done in dataflows@0.0.60
(please try https://github.com/datahq/dataflows/pull/100#issuecomment-534428744)
Excited to try this out, thanks!
Hey, I'm still running into problems with this update @roll . Maybe I just misinterpreted how you were going to implement the solution.
Basically, say I have a datetime field with output_format=%Y-%m-%dT%H:%M:%SZtest
(output_format
is a parameter in the custom processor convert_date
). The convert_date
processor also sets the outputFormat
key in the field datapackage to %Y-%m-%dT%H:%M:%SZtest
. The test
part at the end is just to get it to be different than the normal datetime format %Y-%m-%dT%H:%M:%S
. The custom convert_date
processor first uses this output_format with the python datetime package to set the value of the field to the output of date_obj.strftime(output_format)
. Thus the value in the resource might look something like 2019-07-16T13:29:25Ztest
.
At this point, when I try to use dump_to_path, even with temporal_format_property=outputFormat
, I get the error Field "ISO_DateTime_UTC" can't cast value "2019-07-16T13:29:25Ztest" for type "datetime" with format "%Y-%m-%dT%H:%M:%S"
. It seems as if the dump_to_path caster/validator is still trying to convert the datetime field to the default format (%Y-%m-%dT%H:%M:%S
) before it then changes the format value to %Y-%m-%dT%H:%M:%SZtest
. If, in the convert_date
step, I instead set the value to the result of using the strftime
function with the default format (%Y-%m-%dT%H:%M:%S
) it works and properly dumps the value as 2019-07-16T13:29:25Ztest
.
This could theoretically work, since the user rarely sees the value
inside the row outside of a dump_to_path step. But it won't always work. For example, this is dumb but possible: what if the user tries to use set_type to set the datetime
field to string
? It looks like the value in the row is 2019-07-16T13:29:25Ztest
, but it is actually 2019-07-16T13:29:25
. So when it gets converted to a string, the value is again 2019-07-16T13:29:25
.
More realistic is the possibility of using the custom convert_date
processor on the same field if, for example, multiple date fields in different formats are expected. The user might expect, because they see the value 2019-07-16T13:29:25Ztest
, to write %Y-%m-%dT%H:%M:%SZtest
for the input_format
parameter, when in fact they should be writing %Y-%m-%dT%H:%M:%S
.
Furthermore, using a combine fields processors (such as add_computed_field
), the user will get the value in the row (which is the default format) rather than what they see from a dump_to_path.
Do you understand the issue I am having here @roll ? Is there any workaround to ensure that the dump_to_path processor does not try to cast with the default format
before it switches to using the temporal_format_property
?
@cschloer Could you please share some testing pipeline so I can play with it?
OK I think this should be enough:
date-test:
title: date-test
description: ''
version: v1.0.4
pipeline:
- run: bcodmo_pipeline_processors.load
cache: false
parameters:
format: xlsx
from: https://github.com/BCODMO/frictionless-usecases/blob/master/usecases/754644_Carpenter2018_physical_data/orig/Carpenter%20et%20al.%202018_data.xlsx?raw=true
headers: 1
infer_strategy: strings
cast_strategy: strings
remove_empty_rows: true
skip_rows: []
override_schema:
missingValues:
- ''
- nd
input_separator: ','
sheet: physical data
preserve_formatting: true
- run: bcodmo_pipeline_processors.convert_date
parameters:
resources:
- res1
fields:
- input_type: python
inputs:
- field: Date
format: '%d-%b'
input_timezone: UTC
output_field: ISO_DateTime_UTC
output_format: '%Y-%m-%dT%H:%M:%S%Ztest'
output_timezone: UTC
year: 2016
- run: printer
parameters:
num_rows: 10
- run: bcodmo_pipeline_processors.dump_to_path
parameters:
temporal_format_property: outputFormat
out-path: /home/conrad/test
force-format: true
format: csv
save_pipeline_spec: true
You'll need to use the bcmodo_processors repo and set the DPP_PROCESSOR_PATH to $PROCESSOR_REPO/bcodmo_processors
.
I created a branch called fu-22-roll-examples
that you can use to see the issues I'm talking about.
The first commit on the branch (set outputFormat to output_format) shows the first issue I described, where the value is validated with the old DATETIME_FORMAT even when outputFormat
has been updated. The second commit (Use DATETIME_FORMAT for row value before dump) sets the value in the row to the result of using DATETIME_FORMAT. The resulting csv correctly shows an outputformat with the test
string at the end, but you can see from the printer processor that the values in the rows do not have test
while the pipeline is running.
Let me know if you need something else @roll !
Thanks!
@cschloer I ran the example and I think I understand.
So, the reason why I used a custom format property (e.g. outputFormat
) instead of the normal format
is that internally dpp
uses format
for parsing. So I was able to change format only for the last step - as it saves data to a file. The current behavior is expected and I thought exactly what was discussed here - https://github.com/BCODMO/frictionless-usecases/issues/19 (allowing a custom format for the dump step)
BTW, have you tried changing format
and data values accordingly on the convert_date
step (instead of setting ouputFormat
)?
OK - I didn't realize that the actual value would need to stay as %Y-%m-%dT%H:%M:%SZ
. I still don't understand why: is the schema being validated while the pipeline is running? It seems like not - the validation error is happening in the dump_to_path step. But wasn't the change going to make dump_to_path only use the temporal_format_property
to validate the format? Right now it seems to only use the temporal_format_property
to determine what gets dumped out. What would be most useful is that it also validates that value with the temporal_format_property
field rather than the format
field.
I just tried updating the format key rather than the datetime key and it does not seem to end up in the datapackage. I have the following code:
def modify_datapackage(datapackage_):
dp_resources = datapackage_.get('resources', [])
for resource_ in dp_resources:
if resources.match(resource_['name']):
datapackage_fields = resource_['schema']['fields']
new_field_names = [f['output_field'] for f in fields]
datapackage_fields = [
f for f in datapackage_fields if f['name'] not in new_field_names
]
new_fields = [{
'name': f['output_field'],
'type': 'datetime',
'format': f'{f["output_format"]} CHANGING THE FORMAT HERE',
} for f in fields]
datapackage_fields += new_fields
resource_['schema']['fields'] = datapackage_fields
return datapackage_
This is an example where I set the output format to the default format of %Y-%m-%dT%H:%M:%SZ
so that the pipeline wouldn't crash, but set the actual format of the datapackage in my code to that string + the string CHANGING THE FORMAT HERE
. When I looked at the datapackage.json after the dump, the CHANGING THE FORMAT HERE
string did not end up in the final datapackage. I'm guessing this is an issue related to what you previously said?
format of date/times cannot be changes because it breaks parsing https://github.com/BCODMO/frictionless-usecases/issues/19#issuecomment-508993522
@cschloer
I'm not sure how it works internally (cc @akariv) but it seems to stringify and parse datetime values on different steps so format
has to match.
BTW, consider this example:
temporal:
title: temporal
description: "temporal format"
pipeline:
- run: load
parameters:
from: 'temporal.csv'
override_fields:
date:
outputFormat: '%m/%d/%Y'
- ... steps where data values will be Python's datetime.datetime objects
- run: dump_to_path
parameters:
out-path: 'output'
pretty-descriptor: true
temporal_format_property: outputFormat
After the parsing (e.g. during the load step), you can treat the date
field's values as native Python date objects in all custom processors. I don't fully understand the reason to talk about textual date formats after parsing and before unparsing (dumping to the file).
E.g. in the convert_date
processor you stringify values like row_value = str(row_value)
. I'm curious why - isn't it possible to use datetime.datetime
API to e.g. update timezone etc and update outputFormat
accordingly to be used on the dump_to_path step?
I created a test pipeline. It converts timezone-unaware datetime strings into UTC strings:
datapackage-pipeline.yml
datetime:
title: datetime
pipeline:
- run: load
parameters:
from: 'datetime.csv'
override_fields:
datetime:
type: 'datetime'
outputFormat: '%Y-%m-%dT%H:%M:%SZ'
- run: processors.process_datetime
- run: dump_to_path
parameters:
out-path: 'output'
pretty-descriptor: true
temporal_format_property: outputFormat
datetime.csv
datetime,event
2000-01-01T15:00:00,start
2000-01-01T18:00:00,finish
output/datetime.csv
datetime,event
2000-01-01T12:00:00Z,start
2000-01-01T15:00:00Z,finish
output/datapackage.json
{
"bytes": 73,
"count_of_rows": 2,
"hash": "7359ccbc02f2416dc344fbc8d24b5fc3",
"name": "_",
"profile": "data-package",
"resources": [
{
"dialect": {
"caseSensitiveHeader": false,
"delimiter": ",",
"doubleQuote": true,
"header": true,
"lineTerminator": "\r\n",
"quoteChar": "\"",
"skipInitialSpace": false
},
"dpp:streamedFrom": "datetime.csv",
"dpp:streaming": true,
"encoding": "utf-8",
"format": "csv",
"name": "datetime",
"path": "datetime.csv",
"profile": "tabular-data-resource",
"schema": {
"fields": [
{
"format": "%Y-%m-%dT%H:%M:%SZ",
"name": "datetime",
"type": "datetime"
},
{
"format": "default",
"name": "event",
"type": "string"
}
],
"missingValues": [
""
]
}
}
]
}
processors.process_datetime
import logging
from dateutil import tz
from datetime import datetime, timedelta
from dataflows import Flow, update_resource
from dataflows.helpers import ResourceMatcher
from datapackage_pipelines.wrapper import ingest
from datapackage_pipelines.utilities.flow_utils import spew_flow
def process_resource(rows):
count = 0
for row in rows:
# Parse datetime
row['datetime'] = datetime.strptime(row['datetime'], '%Y-%m-%dT%H:%M:%S')
# Convert to UTC
row['datetime'] = row['datetime'] - timedelta(hours=3)
# Return (we don't need to stringify it)
yield row
def test_datetime(resources=None):
def func(package):
matcher = ResourceMatcher(resources, package.pkg)
yield package.pkg
for resource in package:
if matcher.match(resource.res.name):
yield process_resource(resource)
else:
yield resource
return func
def flow(parameters):
return Flow(
test_datetime(
resources=parameters.get('resources'),
),
)
if __name__ == '__main__':
with ingest() as ctx:
spew_flow(flow(ctx.parameters), ctx)
Sorry for not following up here - the notification of your reply slipped through the cracks.
I think the key feature that the convert_date processor has is the ability to combine multiple fields into one datetime field. Each of those fields individually might not be a datetime and might in fact be arbitrarily difficult for a computer to interpret as a datetime, hence the need for a data manager's human input (Imagine a field named CRUISE
that is structured like SHIP #5329 19
- that's not easy to figure out the 19 is actually the year but it's not crazy to image getting data like that). So these can't actually be interpreted as datetimes during the load step, because the override_fields input is not sufficient.
I convert the row_value to str first in case the type has already been set to number (if it's a field with only Year for example) so that it works properly when combining into a larger string for strptime purposes.
OK I will look into keeping the row value as a datetime throughout the pipeline. I think this will still be a problem in other processor that expect string types? Because when you do str(value) it will return the default string representation of a datetime. I need to think more about our usage of interacting with datetime fields within other processors - it might be okay to assume you will never do a find_replace, split_column, etc. processor on that field.
Keeping the field as a datetime type seems to be working well. I still need to talk with Amber about what this means for data manager behavior in terms of not using temporal fields inside of string processors. I will close this issue when that is done.
Can you clarify the question? Are you talking about whether to do or not do things like find and replace on a date column (vs string type?).
Yes exactly. I was planning to schedule a call with you but got caught up in submission stuff. Thoughts on that @adyork ?
edit: to clarify the question further: Shall I implement the change that allows us to dump dates in any format we want? Assuming we don't need to use any processor that interact with the data as a string.
For our ISO datetime UTC columns
2011-04-26T14:07:12Z
after type set from string to datetime it becomes2011-04-26T14:07:12
It loses the timezone component of our datetime with timezone. We have been keeping it as a string not datetime type so we can preserve the timezone part.
It would be best for us (BCO-DMO) if you could support timezones within datetime, specifically the format I showed above yyyy-mm-ddTHH:MM(:SS)Z like 2011-04-26T14:07:12Z.
We have this format as part of our SOP in our office use it for import into our ERDDAP data server.
In conclusion, for us, supporting ISO datetime with Z would do the job but you should consider being more flexible in supporting date,time, and datetime formats in general. I know a lot of scientists and systems who use a variety of formats for various reasons. Even just talking about valid ISO formats, there are a TON of valid ones so I understand that it is a challenge. The absolute ideal case would be to maintain a specified format along with the types.
@roll I know you have been looking at different formats during the Excel preserve format issues so this seems a bit related. I haven't tested a ton of Excel imports with various formats yet so I am still figuring out what can be preserved with preserve_format. It's on my todo list.