USF-IMARS / imars-etl

:cloud: Tools for `extract` and `load` for IMaRS ETL (Extract, Transform, Load) operations
0 stars 0 forks source link

filepath datetime parse not working well... #14

Closed 7yl4r closed 5 years ago

7yl4r commented 6 years ago

I am trying to rework the metadata merging so it all happens at once and have identified several cases of metadata being improperly parsed from filepaths. In some cases I might be able to fix this by simply modifying _STRFTIME_MAP, but in others (see first example below) it looks like the parse package may not be paying much attention to the width part of the format string - ie : it is reading 6 digits when explicitly told to look for 4.

ERROR: Parse fancy filepath reads args & date from path
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/tylar/imars-etl/imars_etl/filepath/parse_filepath_test.py", line 93, in test_parse_args_and_date_from_filename
...
imars_etl.exceptions.MetadataValidationError.MetadataValidationError: datetime from strptime does not match parsed dt_* vars
    datetime: 2022-05-03 07:00:11
    dt_* containing dict :{'test_arg': 'testyTestArg', 'dt_j': 3, 'dt_Y': 202212, 'dt_S': 1, 'dt_H': 71}
======================================================================
ERROR: parse_filepath on shx_wv2_p1bs *_PIXEL_SHAPE.shx
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/tylar/imars-etl/imars_etl/filepath/parse_filepath_test.py", line 50, in test_parse_filename_shx_wv2_p1bs
...
imars_etl.exceptions.MetadataValidationError.MetadataValidationError: datetime from strptime does not match parsed dt_* vars
    datetime: 2016-02-12 16:25:18
    dt_* containing dict :{'dt_H': 5, 'dt_b': 'FEB1216', 'idNumber': '057488585010_01', 'passNumber': '003', 'dt_d': 2, 'dt_S': 8, 'dt_M': 1, 'dt_y': 16}
7yl4r commented 6 years ago

another example:

'filepath': '/srv/imars-objects/airflow_tmp/processing_modis_aqua_pass_gom_20180803T190000_l2_file'

'load_format': '{dag_id}_%Y%m%dT%H%M%S_{tag}'

parsed from filename:

{'dt_d': 3, 'dt_H': 1900, 'dt_Y': 201808, 'dt_m': 0, 'dag_id': 'processing_modis_aqua_pass_gom', 'dt_S': 0, 'tag': 'l2_file', 'dt_M': 0}
7yl4r commented 6 years ago

I think this is the issue behind the bug mentioned here.

7yl4r commented 6 years ago

Debug output that looks suspicious:

parse: DEBUG: format 
'wv2_{dt_Y:4d}_{dt_m:2d}_{dt_d:2d}T{dt_H:2d}{dt_M:2d}{dt_S:2d}_{area_short_name}_{order_id:9d}_10_0.zip' 
->
'wv2_ *(?P<dt_Y>[-+ ]?\\d+|0[xX][0-9a-fA-F]+|\\d+|0[bB][01]+|0[oO][0-7]+)_ *(?P<dt_m>[-+ ]?\\d+|0[xX][0-9a-fA-F]+|\\d+|0[bB][01]+|0[oO][0-7]+)_ *(?P<dt_d>[-+ ]?\\d+|0[xX][0-9a-fA-F]+|\\d+|0[bB][01]+|0[oO][0-7]+)T *(?P<dt_H>[-+ ]?\\d+|0[xX][0-9a-fA-F]+|\\d+|0[bB][01]+|0[oO][0-7]+) *(?P<dt_M>[-+ ]?\\d+|0[xX][0-9a-fA-F]+|\\d+|0[bB][01]+|0[oO][0-7]+) *(?P<dt_S>[-+ ]?\\d+|0[xX][0-9a-fA-F]+|\\d+|0[bB][01]+|0[oO][0-7]+)_(?P<area_short_name>.+?)_ *(?P<order_id>[-+ ]?\\d+|0[xX][0-9a-fA-F]+|\\d+|0[bB][01]+|0[oO][0-7]+)_10_0\\.zip'
7yl4r commented 5 years ago

Here's a maximally more reproducible example:

from parse import parse
fmt_str = "W_{var_Y:4d}{var_m:2d}_000.xml"
in_str = "W_201301_000.xml"
parse(fmt_str, in_str)

<Result () {'var_Y': 20130, 'var_m': 1}>

Fixed by updating to parse 1.9+