Closed aulemahal closed 1 year ago
If you could use this PR to also address #152, that would be great! I wrote the issue, but it originally came from @mccrayc, so you can check with him for the details.
Looks good to me! The new patterns seem much more intuitive and clean.
Last commit made a few changes. I tested the new code with ouranos_data_catalogs and it made a few bugs appear:
_parse_from_zarr
can now read the time coordinate. Before, xarray was needed, which was slowing down the process a lot.split_dataset
, this is to be implemented in another PR.parse_directory
to remove entries where date_start
is a date, but frequency
is "fx".And I guess I now need to add tests!
+12% coverage :muscle:
Pull Request Checklist:
number
) and pull request (:pull:number
) has been addedWhat kind of change does this PR introduce?
parse_directory
) to a newcatutils.py
moduleos.walk
instead of subprocess callingfind
. The function should now be platform independent and not linux-specific.parse
instead ofintake.source.utils.reverse_format
. It allows some interesting features, like format specifiers.cvs
build_path
. Schema isdata/file_schema.yml
.date_parser
has moved toutils.py
.Does this PR introduce a breaking change?
Yes.
parse_directory
are now specified differently. See doc and example below.globpattern
. The extension is parsed from the patterns. Calls can now mix different extensions. A newdirglob
arg takes care of the "folder filtering" feature.parse_directory
does not usedask
anymore. I believe the code is now easier to understand and speed is not affected too much. Parsing a local filesystem is not supposed to be faster in parallel, but we are often using nfs-mounted disk where this is less true. Testing on doris, parsing the full/datasets/simulation/raw
folder takes the same time between the new and the old versions. See note below on "code complexity".Other information:
Example for the new patterns:
The "variable" field accepts underscores, so it uses the "_" format specifier. No more need to specify each part of the filename if only the last one is needed. Here the "DATES" special field can catch single dates or bounds.
TODO:
Code complexity
Lol. I thought this PR would simplify the
parse_directory
system.The problem is the MRCC5. Or at least, it is the enormous size of this database and its scattering on slow-to-read disks. It is the only reason for all these complexities:
I wanted to make clear that the complexity of this code is NOT only because I like to optimize things. Last time I ran the MRCC5 catalog creation (with this code), it took 7 hours. Just imagine without the optimizations. (And this while missing a full disk).