NOAA-PMEL / Ferret

The Ferret program from NOAA/PMEL
https://ferret.pmel.noaa.gov/Ferret/
The Unlicense
55 stars 20 forks source link

Make defining of aggregations faster and easier -- support larger aggregations #1523

Open karlmsmith opened 6 years ago

karlmsmith commented 6 years ago

Reported by steven.c.hankin on 26 Mar 2015 18:27 UTC In the current Ferret we have great flexibility in defining aggregations. Each new file to be aggregated is a separate Ferret dataset. Variable definitions can be customized dataset by dataset in order to achieve conformability and make aggregation possible. This is very powerful and general.

What Ferret lacks is the easy ability to define routine aggregations quickly. This trac ticket offers a few thoughts on how to achieve that goal.

The new functionality we need is:

  1. the ability to initialize lots of datasets at once quickly and easily
  2. hiding of datasets that have been pulled into aggregations

To initialize datasets quickly how about:

SET DATA/LIKE=/TDELTA=/LDELTA=/MANIFEST=  file1, file2, file3, ...
DEFINE AGGREGATION/<axis>/D=lo:hi aggname

o "file1, file2, file3, ..." may be omitted if /MANIFEST=filename.txt is provided , pointing to a file containing a list of names. Note that the files must be time-sequenced as provided.

o /LIKE (no argument), means that name no. 1 provides the pattern and other names should be initialized as clones in all except their time axis. If an argument is provided, LIKE=foobar, then foobar is the pattern dataset. Ferret should normally not even open the subsequent datasets to initialize them. It need only copy the internal dataset COMMON variables, so it will be super fast. (/LIKE can only be used on dataset with regularly spaced time axes.)

o If /TDELTA and /LDELTA are absent, then Ferret will assume that the time axis of successive files stack one after the next for a regular time spacing. Use /TDELTA or /LDELTA to tell Ferret the delta T of initialization between forecast datasets.

o It might be nice also to provide a /DIAGNOSTIC option (or borrow SET MODE DIAG), in which case Ferret does open each dataset and confirm that it truly is a clone of the LIKE= dataset

o SET DATA could define symbols ($Dset_Like_lo) and ($Dset_Like_hi) that could be used as arguments in the subsequent DEFINE AGGREGATION/D=lo:hi

o The DEFINE AGGREGATION command would set all of the component datasets to a "hidden" state. Dataset hiding could be kept very simple. It need not effect the numbering of datasets, or even their accessibility when, say, "D=5" is encountered and dataset 5 is hidden. It need effect only the SHOW DATA command. SHOW DATA/HIDDEN could temporarily override dataset hiding. SET DATA/HIDE and SET DATA/VISIBLE could provide individual control.

Migrated-From: http://dunkel.pmel.noaa.gov/trac/ferret/ticket/2251

karlmsmith commented 6 years ago

Comment by steven.c.hankin on 28 Mar 2015 15:31 UTC A perhaps superior variation on the above syntax:

DEFINE AGGREGATION/<axis>/LIKE=/TDELTA=/LDELTA=/ aggname = my_file_list

where "my_file_list" is a Ferret string variable that contains a sorted list of the desire input files. This would reduce the setup to a single command, and make the hidden datasets more truly hidden. "/LIKE" would rarely be used (maybe drop it), though it would offer a way to support special dataset set-ups, such as order-permuted datasets. (Potentially even LET/D definitions could be copied by /LIKE, if there is a use case where this is worth the effort.) A user could still utilize an external file ("manifest") of sorted filenames by using

LET my_file_list = {spawn:cat manifest.txt}
karlmsmith commented 6 years ago

Comment by steven.c.hankin on 8 Apr 2015 17:48 UTC Note that if the ability to define aggregations quickly/easily is successful as we hope, then a scaling problem will arise in the variable uvar_grid. It is 2-dimensional: max_uvar X max_gfdl_dsets, which will make it grow awfully large as the number of (hidden) datasets grows. A fairly easy way around this will be to define a pool of grids, instead of the 2d array --

    uv_grids(pool_size)
    uv_grid_dset(pool_size)

Then use linked lists -- a free list and a base_pointer for each individual uvar. Each base pointer points to the list of the grids that uvar owns. The code changes can be kept a little smaller by creating a new INTEGER FUNCTION UVAR_GRID(uvar,dset) that returns the linked list member. It will behave just like the current direct references to the 2d uvar_grid array.

karlmsmith commented 6 years ago

Modified by @AndrewWittenberg on 10 Apr 2015 22:13 UTC

karlmsmith commented 6 years ago

Comment by @AnsleyManke on 19 May 2015 23:30 UTC All or parts of this to be included in the next PyFerret release?