pyFerret scalability: loops and performance

karlmsmith commented 6 years ago

Reported by steven.c.hankin on 16 May 2016 17:01 UTC Internally in the current Ferret code there are a number of places where brute-force loops over the ranges 1:maxdsets and 1:maxvars are used to locate a dataset. As the number of datasets grows to become a much larger number, these loops will become a (modest) performance drag.

The code would be improved by managing datasets as a LIST and better utilizing the existing LISTs of variables within a dataset, so that the brute force looping is no longer done.

Migrated-From: http://dunkel.pmel.noaa.gov/trac/ferret/ticket/2424

karlmsmith commented 6 years ago

Comment by steven.c.hankin on 16 May 2016 17:57 UTC Since this ticket is mostly a scratch pad I'll add a related thought:

There is a fair amount of information that is duplicated between the old XDSET_INFO common area and the much newer ncdset structures found in NCF_Util.h. The name and path of the dataset are the largest of these, taking 4k bytes of COMMON per dataset. The code could be modernized by pushing all of the dataset metadata into the ncdset structure and eliminating most of the variables from XDSETINFO. By creating a collection of GET and SET_ subroutines that moved metadata into and out of the ncdset structure, the Ferret code would be largely unchanged as we adapt to this switch -- i.e. there would be lots of small edits, but the logic unchanged.

This change would be a natural part of scaling pyFerret up to handle thousands of datasets.

karlmsmith commented 6 years ago

Adding @AndrewWittenberg

NOAA-PMEL / Ferret

pyFerret scalability: loops and performance #1696