Closed kgerheiser closed 2 years ago
Is that a netCDF error message? Or an fv3 error message?
If the variable name is "xaxis_1" that should be quite legal...
It's an error message from NetCDF (nf90_def_var
) within FMS. Interestingly, I ran into a similar issue last night on Orion with Intel where the output failed in the same file, but it actually crashed with a stack trace.
Slash character is disallowed in any identifier.
That's in the filename, not the variable name, right?
ALso this works in 4.7.4. @DennisHeimbigner have you changed any name rules since then?
Yes, that's the filename. I don't think it's related to the slash because the file exists when the failure happens. And it works in 4.8.1 too.
I was going to do binary search by doing a few more builds to try and narrow down what commit caused the issue. It's somewhere between 4.8.1 and HEAD. So, it happened pretty recently.
Narrowed it down to somewhere between Aug-24 and Sep-2.
This commit doesn't work: https://github.com/Unidata/netcdf-c/commit/09defc5c72d6df64514fb7c348f2d2dfe0502f44
This one does: https://github.com/Unidata/netcdf-c/commit/00cabb9486ac36173c44ba102c3a13a4d31ddf69
Can you send the stack trace?
Unfortunately, gfortran just gives me a hexdump instead of a backtrace. Going to try again with Intel, but have to re-compile a lot of things.
But I actually think it's a NetCDF-Fortran issue.
If I build the main
branch of NetCDF-C with NetCDF-Fortran 4.5.3, the run completes as expected, but if I use revert-305-revert-304-ejh_quantize
of NetCDF-Fortran (not even using the quantize feature), the bug appears.
But that doesn't really make sense either. The PR does nothing but add a new interface.
I was able to reproduce the issue on one of our supercomputers using NetCDF-C main and Netcdf-Fortran quantize branch with Intel compilers. Unlike GNU, Intel crashes instead of NetCDF returning an error code.
The stack trace doesn't go into NetCDF (maybe I need to add -g
to CFLAGS), but points to this line in FMS:
err = nf90_def_var(fileobj%ncid, trim(variable_name), vtype, dimids, varid)
I can also confirm that it seems to be a NetCDF-Fortran issue with the quantize branch, or some commit since 4.5.3. The run completed fine using the main branch of NetCDF-C and NetCDF-Fortran 4.5.3.
Is it possible to check the result of that trim() call to see what it is returning? It is barely possible that there is a leading blank.
Is there any more information about this problem?
@kgerheiser can you make a small Fortran test program that demonstrates this problem?
In general, when you think you've found a bug in netCDF, a small test program is the next step and the fastest way to demonstrate a bug and start to resolve it...
I compiled one of the test programs included in the NetCDF Fortran PR, and couldn't reproduce the issue. I could only reproduce the issue with the UFS weather model. And even then across different runs with different commits I could sometimes reproduce the issue and other times not. It has been very difficult trying to pin down exactly what goes wrong.
I think it's possible it's a memory bug in the application that manifests in NetCDF.
OK, I suggest you open an issue in fv3atm about this, so at least if it is encountered by someone else, there will be a record of your efforts, even if you can't find it. I think this issue should be closed, if this seems not to be a netCDF problem...
@kgerheiser can you close this issue please?
Changing from NetCDF 4.7.4 or 4.8.1 to the latest head of
main
, I get an error about invalid characters.This is the message printed out from the application I'm running (ufs-weather-model) within the library FMS (https://github.com/NOAA-GFDL/FMS)
Using GCC 9.4.0 and macOS 11.6
Guessing it's related to macOS since it doesn't show up on Linux, but don't know what has changed since NetCDF 4.7.4. I'm going to dig a little further, but maybe someone knows of a change within NetCDF that would cause this.
Could it be the slash in
RESTART/
?Code also works with NetCDF/4.8.1, so it's something recent.