enram / vptstools

Python library to transfer and convert vertical profile time series data
https://enram.github.io/vptstools/
MIT License
3 stars 1 forks source link

Add support for source_file and other schema updates #33

Closed stijnvanhoey closed 1 year ago

stijnvanhoey commented 1 year ago

See https://github.com/enram/vpts-csv/pull/42. Apart from removing the unit tests checks for the no-longer-required fields and the lat/lon-switch, the main focus is the 'source_file' field addition.

The adjustments enable to provide a (custom) source_field for the vp and vpts functionalities (main user API functions):

vp

When using the vp function without 'source_file' parameter, the file name itself is used as source_file:

path_h5 = "../data/raw/baltrad/silis_vp_20221114T190500Z_0xb.h5"
df_vp = vp(path_h5, "v1")
radar datetime height source_file
0 silis 2022-11-14T19:05:00Z 0 silis_vp_20221114T190500Z_0xb.h5
1 silis 2022-11-14T19:05:00Z 200 silis_vp_20221114T190500Z_0xb.h5
... ... ... ... ...

It is however possible to overwrite the source_file default using either a str or a callable:

path_h5 = "../data/raw/baltrad/silis_vp_20221114T190500Z_0xb.h5"
df_vp = vp(path_h5, "v1", source_file="DUMMY")
radar datetime height source_file
0 silis 2022-11-14T19:05:00Z 0 DUMMY
1 silis 2022-11-14T19:05:00Z 200 DUMMY
... ... ... ... ...
path_h5 = "../data/raw/baltrad/silis_vp_20221114T190500Z_0xb.h5"
df_vp = vp(path_h5, "v1", source_file=lambda x: "s3://custom/path/" + Path(x).name)
radar datetime height source_file
0 silis 2022-11-14T19:05:00Z 0 s3://custom/path/silis_vp_20221114T190500Z_0xb.h5
1 silis 2022-11-14T19:05:00Z 200 s3://custom/path/silis_vp_20221114T190500Z_0xb.h5
... ... ... ... ...

This last option is mainly to support custom 'source_file' fields when using vpts (and having multiple file paths as input).

vpts

BY default it uses the file name itself (see vp section), but the customization using a callable to convert each of the file-paths provides the ability to use custom source_file constructions. By default the conversion to file name is used (default callable), but this can be overwritten, e.g. for the s3-service that runs daily:

from vptstools.vpts import OdimFilePath

def convert_to_source(file_path):
    """Translate a file_path into the full s3 url representation for baltrad data stored in the aloft bucket"""
    return OdimFilePath.from_file_name(file_path, source="baltrad").s3_url_h5("aloft")

df_vpts = vpts(file_paths, "v1", source_file=convert_to_source)
radar datetime height source_file
0 bejab 2022-11-12T22:45:00Z 0 s3://aloft/baltrad/hdf5/bejab/2022/11/12/bejab_vp_20221112T225000Z_0x9.h5
1 bejab 2022-11-12T22:45:00Z 200 s3://aloft/baltrad/hdf5/bejab/2022/11/12/bejab_vp_20221112T225000Z_0x9.h5
... ... ... ... ...
0 bewid 2022-11-13T02:30:00Z 0 s3://aloft/baltrad/hdf5/bewid/2022/11/13/bewid_vp_20221113T023500Z_0xb.h5
1 bewid 2022-11-13T02:30:00Z 200 s3://aloft/baltrad/hdf5/bewid/2022/11/13/bewid_vp_20221113T023500Z_0xb.h5
... ... ... ... ...

For the daily service, also the source and the bucket name are abstracted (coming from the inventory file), but this introduces the implementation which can be used both within as outside the scope of the daily service

stijnvanhoey commented 1 year ago

convert_to_source and not the file name itself cf. vp()

The by default provided _convert_to_source callback is effectively doing this for each of the file names provided to the vpts-input (as vpts uses a list/generator/... of file names as input, it need to to the translation for each of these files). Hence, by default (when he user is not providing a callable), this _convert_to_source one is used:

https://github.com/enram/vptstools/blob/5e1919a71d3d9c27ee1aa2e0b900c85a44777a6d/src/vptstools/vpts.py#L534-L536

The user can overwrite this default callable (e.g. the service for the aloft S3 bucket is doing this to include the s3 url)