barronh / pseudonetcdf

PseudoNetCDF like NetCDF except for many scientific format backends
GNU Lesser General Public License v3.0
76 stars 35 forks source link

pnc subset function eliminating variables with spaces #140

Closed SfarrellCMAQ closed 4 months ago

SfarrellCMAQ commented 7 months ago

Hi Barron,

I've noticed when executing the pnc subset function on a list of variables in the shp2cmaq tool that variables that have spaces in them are eliminated from the variables that are subset. So far I've only noticed this with spaces but it may apply to other characters.

Thank you! Sara

barronh commented 7 months ago

This turned out to be complicated, because it does not always fail. Below shows an example that works just fine. It does, however, fail under specific conditions.

import PseudoNetCDF as pnc
​
gf = pnc.pncopen(
    '/home/bhenders/GRIDDESC', GDNAM='12US1', format='griddesc', withcf=False,
    var_kwds=['TEST1', 'TEST 2']
)
newf = gf.subset(['TEST1', 'TEST 2'])
print(list(newf.variables))
# ['TEST1', 'TEST 2', 'TFLAG']

It fails when there is at least one variable whose name is longer than 16 characters long, which is against the IOAPI conventions. See below loses 'TEST 2' when a long variable name is added.

import PseudoNetCDF as pnc
​
gf = pnc.pncopen(
    '/home/bhenders/GRIDDESC', GDNAM='12US1', format='griddesc', withcf=False,
    var_kwds=['TEST1', 'TEST 2', 'TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT']
)
newf = gf.subset(['TEST1', 'TEST 2'])
print(list(newf.variables))
​# ['TEST1', 'TFLAG']

I'll describe the problem in detail. The subsetVariables method (alias subset)[1] checks if the IOAPI file has the variable before subsetting by using the IOAPI attribute VAR-LIST. VAR-LIST is a string with variable names as substrings where each substring is 16 characters long. This works because IOAPI requires variable names to be no more than 16 characters long. However, the VAR-LIST parser getVarlist[2] is able to handle long (17+) names, but only if they have no spaces. If all the variables names are all less than 16 characters long, then VAR-LIST meets the IOAPI conventions and getVarlist will chunk VAR-LIST into 16 character substrings (each a name) -- and work as expected (even with spaces in names). If any variable is longer than 16 characters long, then VAR-LIST cannot be reliably split to get names. Instead, getVarlist tries to split names based on spaces because usually IOAPI files don't have spaces in names. This has the effect of not finding any variables with spaces in the name.

Essentially, this only happens when two of my expectations of IOAPI files are simultaneously violated. I think the best way to handle this is to only violate one. ;) I'll recommend a change to the shp2cmaq PR, so that this won't come up.

  1. https://github.com/barronh/pseudonetcdf/blob/master/src/PseudoNetCDF/cmaqfiles/_ioapi.py#L435
  2. https://github.com/barronh/pseudonetcdf/blob/master/src/PseudoNetCDF/cmaqfiles/_ioapi.py#L663