BCODMO / frictionless-usecases

Example datasets submitted to BCO-DMO
MIT License
2 stars 0 forks source link

Supporting time zone in datetime type #22

Closed adyork closed 4 years ago

adyork commented 5 years ago

For our ISO datetime UTC columns 2011-04-26T14:07:12Z after type set from string to datetime it becomes 2011-04-26T14:07:12

It loses the timezone component of our datetime with timezone. We have been keeping it as a string not datetime type so we can preserve the timezone part.

It would be best for us (BCO-DMO) if you could support timezones within datetime, specifically the format I showed above yyyy-mm-ddTHH:MM(:SS)Z like 2011-04-26T14:07:12Z.

We have this format as part of our SOP in our office use it for import into our ERDDAP data server.

In conclusion, for us, supporting ISO datetime with Z would do the job but you should consider being more flexible in supporting date,time, and datetime formats in general. I know a lot of scientists and systems who use a variety of formats for various reasons. Even just talking about valid ISO formats, there are a TON of valid ones so I understand that it is a challenge. The absolute ideal case would be to maintain a specified format along with the types.

@roll I know you have been looking at different formats during the Excel preserve format issues so this seems a bit related. I haven't tested a ton of Excel imports with various formats yet so I am still figuring out what can be preserved with preserve_format. It's on my todo list.

cschloer commented 5 years ago

Hey @lwinfree @roll is there any update on this issue in particular? I think this is a rather important one for us. Specifically the issue is that dump_to_path is forcing a particular datetime format . A quick fix for us has been to fork dataflows and change that constant to be what we need (%Y-%m-%dT%H:%M:%SZ) but that isn't a great long term solution.

Is there any way the dumpers can output the datetime fields using the format specified within the datapackage?

roll commented 5 years ago

Hi @cschloer, I was on Holiday. Now I'm back and I will prioritize this one.

roll commented 5 years ago

@cschloer So this one is barely a subset of #19. Could we please return to #19 for a sec, because I, actually, wanted you to take a look/brainstorm possible API solution on your end for custom data/time formats.

roll commented 5 years ago

To be re-checked after #19 is done

roll commented 5 years ago

It's done in dataflows@0.0.60 (please try https://github.com/datahq/dataflows/pull/100#issuecomment-534428744)

adyork commented 5 years ago

Excited to try this out, thanks!

cschloer commented 4 years ago

Hey, I'm still running into problems with this update @roll . Maybe I just misinterpreted how you were going to implement the solution.

Basically, say I have a datetime field with output_format=%Y-%m-%dT%H:%M:%SZtest (output_format is a parameter in the custom processor convert_date). The convert_date processor also sets the outputFormat key in the field datapackage to %Y-%m-%dT%H:%M:%SZtest. The test part at the end is just to get it to be different than the normal datetime format %Y-%m-%dT%H:%M:%S. The custom convert_date processor first uses this output_format with the python datetime package to set the value of the field to the output of date_obj.strftime(output_format). Thus the value in the resource might look something like 2019-07-16T13:29:25Ztest.

At this point, when I try to use dump_to_path, even with temporal_format_property=outputFormat, I get the error Field "ISO_DateTime_UTC" can't cast value "2019-07-16T13:29:25Ztest" for type "datetime" with format "%Y-%m-%dT%H:%M:%S". It seems as if the dump_to_path caster/validator is still trying to convert the datetime field to the default format (%Y-%m-%dT%H:%M:%S) before it then changes the format value to %Y-%m-%dT%H:%M:%SZtest. If, in the convert_date step, I instead set the value to the result of using the strftime function with the default format (%Y-%m-%dT%H:%M:%S) it works and properly dumps the value as 2019-07-16T13:29:25Ztest.

This could theoretically work, since the user rarely sees the value inside the row outside of a dump_to_path step. But it won't always work. For example, this is dumb but possible: what if the user tries to use set_type to set the datetime field to string? It looks like the value in the row is 2019-07-16T13:29:25Ztest, but it is actually 2019-07-16T13:29:25. So when it gets converted to a string, the value is again 2019-07-16T13:29:25. More realistic is the possibility of using the custom convert_date processor on the same field if, for example, multiple date fields in different formats are expected. The user might expect, because they see the value 2019-07-16T13:29:25Ztest, to write %Y-%m-%dT%H:%M:%SZtest for the input_format parameter, when in fact they should be writing %Y-%m-%dT%H:%M:%S. Furthermore, using a combine fields processors (such as add_computed_field), the user will get the value in the row (which is the default format) rather than what they see from a dump_to_path.

Do you understand the issue I am having here @roll ? Is there any workaround to ensure that the dump_to_path processor does not try to cast with the default format before it switches to using the temporal_format_property?

roll commented 4 years ago

@cschloer Could you please share some testing pipeline so I can play with it?

cschloer commented 4 years ago

OK I think this should be enough:

date-test:
  title: date-test
  description: ''
  version: v1.0.4
  pipeline:
  - run: bcodmo_pipeline_processors.load
    cache: false
    parameters:
      format: xlsx
      from: https://github.com/BCODMO/frictionless-usecases/blob/master/usecases/754644_Carpenter2018_physical_data/orig/Carpenter%20et%20al.%202018_data.xlsx?raw=true
      headers: 1
      infer_strategy: strings
      cast_strategy: strings
      remove_empty_rows: true
      skip_rows: []
      override_schema:
        missingValues:
        - ''
        - nd
      input_separator: ','
      sheet: physical data
      preserve_formatting: true
  - run: bcodmo_pipeline_processors.convert_date
    parameters:
      resources:
      - res1
      fields:
      - input_type: python
        inputs:
        - field: Date
          format: '%d-%b'
        input_timezone: UTC
        output_field: ISO_DateTime_UTC
        output_format: '%Y-%m-%dT%H:%M:%S%Ztest'
        output_timezone: UTC
        year: 2016
  - run: printer
    parameters:
      num_rows: 10
  - run: bcodmo_pipeline_processors.dump_to_path
    parameters:
      temporal_format_property: outputFormat
      out-path: /home/conrad/test
      force-format: true
      format: csv
      save_pipeline_spec: true

You'll need to use the bcmodo_processors repo and set the DPP_PROCESSOR_PATH to $PROCESSOR_REPO/bcodmo_processors.

I created a branch called fu-22-roll-examples that you can use to see the issues I'm talking about.

The first commit on the branch (set outputFormat to output_format) shows the first issue I described, where the value is validated with the old DATETIME_FORMAT even when outputFormat has been updated. The second commit (Use DATETIME_FORMAT for row value before dump) sets the value in the row to the result of using DATETIME_FORMAT. The resulting csv correctly shows an outputformat with the test string at the end, but you can see from the printer processor that the values in the rows do not have test while the pipeline is running.

Let me know if you need something else @roll !

roll commented 4 years ago

Thanks!

roll commented 4 years ago

@cschloer I ran the example and I think I understand.

So, the reason why I used a custom format property (e.g. outputFormat) instead of the normal format is that internally dpp uses format for parsing. So I was able to change format only for the last step - as it saves data to a file. The current behavior is expected and I thought exactly what was discussed here - https://github.com/BCODMO/frictionless-usecases/issues/19 (allowing a custom format for the dump step)

BTW, have you tried changing format and data values accordingly on the convert_date step (instead of setting ouputFormat)?

cschloer commented 4 years ago

OK - I didn't realize that the actual value would need to stay as %Y-%m-%dT%H:%M:%SZ. I still don't understand why: is the schema being validated while the pipeline is running? It seems like not - the validation error is happening in the dump_to_path step. But wasn't the change going to make dump_to_path only use the temporal_format_property to validate the format? Right now it seems to only use the temporal_format_property to determine what gets dumped out. What would be most useful is that it also validates that value with the temporal_format_property field rather than the format field.

I just tried updating the format key rather than the datetime key and it does not seem to end up in the datapackage. I have the following code:

def modify_datapackage(datapackage_):
    dp_resources = datapackage_.get('resources', [])
    for resource_ in dp_resources:
        if resources.match(resource_['name']):
            datapackage_fields = resource_['schema']['fields']
            new_field_names = [f['output_field'] for f in fields]
            datapackage_fields = [
                f for f in datapackage_fields if f['name'] not in new_field_names
            ]
            new_fields = [{
                'name': f['output_field'],
                'type': 'datetime',
                'format': f'{f["output_format"]} CHANGING THE FORMAT HERE',
            } for f in fields]
            datapackage_fields += new_fields
            resource_['schema']['fields'] = datapackage_fields
    return datapackage_

This is an example where I set the output format to the default format of %Y-%m-%dT%H:%M:%SZ so that the pipeline wouldn't crash, but set the actual format of the datapackage in my code to that string + the string CHANGING THE FORMAT HERE. When I looked at the datapackage.json after the dump, the CHANGING THE FORMAT HERE string did not end up in the final datapackage. I'm guessing this is an issue related to what you previously said?

format of date/times cannot be changes because it breaks parsing https://github.com/BCODMO/frictionless-usecases/issues/19#issuecomment-508993522

roll commented 4 years ago

@cschloer I'm not sure how it works internally (cc @akariv) but it seems to stringify and parse datetime values on different steps so format has to match.

BTW, consider this example:

temporal:
  title: temporal
  description: "temporal format"
  pipeline:

  - run: load
    parameters:
      from: 'temporal.csv'
      override_fields:
        date:
          outputFormat: '%m/%d/%Y'

  - ... steps where data values will be Python's datetime.datetime objects

  - run: dump_to_path
    parameters:
      out-path: 'output'
      pretty-descriptor: true
      temporal_format_property: outputFormat

After the parsing (e.g. during the load step), you can treat the date field's values as native Python date objects in all custom processors. I don't fully understand the reason to talk about textual date formats after parsing and before unparsing (dumping to the file).

E.g. in the convert_date processor you stringify values like row_value = str(row_value). I'm curious why - isn't it possible to use datetime.datetime API to e.g. update timezone etc and update outputFormat accordingly to be used on the dump_to_path step?

roll commented 4 years ago

I created a test pipeline. It converts timezone-unaware datetime strings into UTC strings:

datapackage-pipeline.yml

datetime:
  title: datetime
  pipeline:

  - run: load
    parameters:
      from: 'datetime.csv'
      override_fields:
        datetime:
          type: 'datetime'
          outputFormat: '%Y-%m-%dT%H:%M:%SZ'

  - run: processors.process_datetime

  - run: dump_to_path
    parameters:
      out-path: 'output'
      pretty-descriptor: true
      temporal_format_property: outputFormat

datetime.csv

datetime,event
2000-01-01T15:00:00,start
2000-01-01T18:00:00,finish

output/datetime.csv

datetime,event
2000-01-01T12:00:00Z,start
2000-01-01T15:00:00Z,finish

output/datapackage.json

{
  "bytes": 73,
  "count_of_rows": 2,
  "hash": "7359ccbc02f2416dc344fbc8d24b5fc3",
  "name": "_",
  "profile": "data-package",
  "resources": [
    {
      "dialect": {
        "caseSensitiveHeader": false,
        "delimiter": ",",
        "doubleQuote": true,
        "header": true,
        "lineTerminator": "\r\n",
        "quoteChar": "\"",
        "skipInitialSpace": false
      },
      "dpp:streamedFrom": "datetime.csv",
      "dpp:streaming": true,
      "encoding": "utf-8",
      "format": "csv",
      "name": "datetime",
      "path": "datetime.csv",
      "profile": "tabular-data-resource",
      "schema": {
        "fields": [
          {
            "format": "%Y-%m-%dT%H:%M:%SZ",
            "name": "datetime",
            "type": "datetime"
          },
          {
            "format": "default",
            "name": "event",
            "type": "string"
          }
        ],
        "missingValues": [
          ""
        ]
      }
    }
  ]
}

processors.process_datetime

import logging
from dateutil import tz
from datetime import datetime, timedelta
from dataflows import Flow, update_resource
from dataflows.helpers import ResourceMatcher
from datapackage_pipelines.wrapper import ingest
from datapackage_pipelines.utilities.flow_utils import spew_flow

def process_resource(rows):
    count = 0
    for row in rows:
        # Parse datetime
        row['datetime'] = datetime.strptime(row['datetime'], '%Y-%m-%dT%H:%M:%S')
        # Convert to UTC
        row['datetime'] = row['datetime'] - timedelta(hours=3)
        # Return (we don't need to stringify it)
        yield row

def test_datetime(resources=None):
    def func(package):
        matcher = ResourceMatcher(resources, package.pkg)
        yield package.pkg
        for resource in package:
            if matcher.match(resource.res.name):
                yield process_resource(resource)
            else:
                yield resource
    return func

def flow(parameters):
    return Flow(
        test_datetime(
            resources=parameters.get('resources'),
        ),
    )

if __name__ == '__main__':
    with ingest() as ctx:
        spew_flow(flow(ctx.parameters), ctx)
cschloer commented 4 years ago

Sorry for not following up here - the notification of your reply slipped through the cracks.

I think the key feature that the convert_date processor has is the ability to combine multiple fields into one datetime field. Each of those fields individually might not be a datetime and might in fact be arbitrarily difficult for a computer to interpret as a datetime, hence the need for a data manager's human input (Imagine a field named CRUISE that is structured like SHIP #5329 19 - that's not easy to figure out the 19 is actually the year but it's not crazy to image getting data like that). So these can't actually be interpreted as datetimes during the load step, because the override_fields input is not sufficient.

I convert the row_value to str first in case the type has already been set to number (if it's a field with only Year for example) so that it works properly when combining into a larger string for strptime purposes.

OK I will look into keeping the row value as a datetime throughout the pipeline. I think this will still be a problem in other processor that expect string types? Because when you do str(value) it will return the default string representation of a datetime. I need to think more about our usage of interacting with datetime fields within other processors - it might be okay to assume you will never do a find_replace, split_column, etc. processor on that field.

cschloer commented 4 years ago

Keeping the field as a datetime type seems to be working well. I still need to talk with Amber about what this means for data manager behavior in terms of not using temporal fields inside of string processors. I will close this issue when that is done.

adyork commented 4 years ago

Can you clarify the question? Are you talking about whether to do or not do things like find and replace on a date column (vs string type?).

cschloer commented 4 years ago

Yes exactly. I was planning to schedule a call with you but got caught up in submission stuff. Thoughts on that @adyork ?

edit: to clarify the question further: Shall I implement the change that allows us to dump dates in any format we want? Assuming we don't need to use any processor that interact with the data as a string.