Unidata / netcdf-c

Official GitHub repository for netCDF-C libraries and utilities.
BSD 3-Clause "New" or "Revised" License
515 stars 262 forks source link

"Name contains illegal characters" in main branch, but not 4.8.1 #2158

Closed kgerheiser closed 2 years ago

kgerheiser commented 2 years ago

Changing from NetCDF 4.7.4 or 4.8.1 to the latest head of main, I get an error about invalid characters.

FATAL from PE  0: NetCDF: Name contains illegal characters: 

netcdf_add_variable:  file: RESTART/fv_core.res.nc variable: xaxis_1

This is the message printed out from the application I'm running (ufs-weather-model) within the library FMS (https://github.com/NOAA-GFDL/FMS)

Using GCC 9.4.0 and macOS 11.6

Guessing it's related to macOS since it doesn't show up on Linux, but don't know what has changed since NetCDF 4.7.4. I'm going to dig a little further, but maybe someone knows of a change within NetCDF that would cause this.

Could it be the slash in RESTART/?

Code also works with NetCDF/4.8.1, so it's something recent.

edwardhartnett commented 2 years ago

Is that a netCDF error message? Or an fv3 error message?

If the variable name is "xaxis_1" that should be quite legal...

kgerheiser commented 2 years ago

It's an error message from NetCDF (nf90_def_var) within FMS. Interestingly, I ran into a similar issue last night on Orion with Intel where the output failed in the same file, but it actually crashed with a stack trace.

DennisHeimbigner commented 2 years ago

Slash character is disallowed in any identifier.

edwardhartnett commented 2 years ago

That's in the filename, not the variable name, right?

ALso this works in 4.7.4. @DennisHeimbigner have you changed any name rules since then?

kgerheiser commented 2 years ago

Yes, that's the filename. I don't think it's related to the slash because the file exists when the failure happens. And it works in 4.8.1 too.

I was going to do binary search by doing a few more builds to try and narrow down what commit caused the issue. It's somewhere between 4.8.1 and HEAD. So, it happened pretty recently.

kgerheiser commented 2 years ago

Narrowed it down to somewhere between Aug-24 and Sep-2.

This commit doesn't work: https://github.com/Unidata/netcdf-c/commit/09defc5c72d6df64514fb7c348f2d2dfe0502f44

This one does: https://github.com/Unidata/netcdf-c/commit/00cabb9486ac36173c44ba102c3a13a4d31ddf69

DennisHeimbigner commented 2 years ago

Can you send the stack trace?

kgerheiser commented 2 years ago

Unfortunately, gfortran just gives me a hexdump instead of a backtrace. Going to try again with Intel, but have to re-compile a lot of things.

But I actually think it's a NetCDF-Fortran issue.

If I build the main branch of NetCDF-C with NetCDF-Fortran 4.5.3, the run completes as expected, but if I use revert-305-revert-304-ejh_quantize of NetCDF-Fortran (not even using the quantize feature), the bug appears.

But that doesn't really make sense either. The PR does nothing but add a new interface.

https://github.com/Unidata/netcdf-fortran/pull/306/files

kgerheiser commented 2 years ago

I was able to reproduce the issue on one of our supercomputers using NetCDF-C main and Netcdf-Fortran quantize branch with Intel compilers. Unlike GNU, Intel crashes instead of NetCDF returning an error code.

The stack trace doesn't go into NetCDF (maybe I need to add -g to CFLAGS), but points to this line in FMS:

https://github.com/NOAA-GFDL/FMS/blob/9d25a1e4f5e4b4040b513b5d40b09d7b6f904fbf/fms2_io/netcdf_io.F90#L896

err = nf90_def_var(fileobj%ncid, trim(variable_name), vtype, dimids, varid)

kgerheiser commented 2 years ago

I can also confirm that it seems to be a NetCDF-Fortran issue with the quantize branch, or some commit since 4.5.3. The run completed fine using the main branch of NetCDF-C and NetCDF-Fortran 4.5.3.

https://github.com/Unidata/netcdf-fortran/pull/306/files

DennisHeimbigner commented 2 years ago

Is it possible to check the result of that trim() call to see what it is returning? It is barely possible that there is a leading blank.

DennisHeimbigner commented 2 years ago

Is there any more information about this problem?

edwardhartnett commented 2 years ago

@kgerheiser can you make a small Fortran test program that demonstrates this problem?

In general, when you think you've found a bug in netCDF, a small test program is the next step and the fastest way to demonstrate a bug and start to resolve it...

kgerheiser commented 2 years ago

I compiled one of the test programs included in the NetCDF Fortran PR, and couldn't reproduce the issue. I could only reproduce the issue with the UFS weather model. And even then across different runs with different commits I could sometimes reproduce the issue and other times not. It has been very difficult trying to pin down exactly what goes wrong.

I think it's possible it's a memory bug in the application that manifests in NetCDF.

edwardhartnett commented 2 years ago

OK, I suggest you open an issue in fv3atm about this, so at least if it is encountered by someone else, there will be a record of your efforts, even if you can't find it. I think this issue should be closed, if this seems not to be a netCDF problem...

edwardhartnett commented 2 years ago

@kgerheiser can you close this issue please?