JCSDA-internal / ioda-converters

Various converters for getting obs data in and out of IODA
9 stars 2 forks source link

Fix failing BUFR ctests. #1419

Closed rmclaren closed 9 months ago

rmclaren commented 9 months ago

Description

Fixes failing ctests for ioda-converter BUFR code.

rmclaren commented 9 months ago

@PraveenKumar-NOAA In your script bufr_ncep_prepbufr_adpupa.py you create a field for specificHumidity. The latest code fixed a problem with this field, and I was wondering if you would double checkout the data in output file testoutput/prepbufr_adpupa_api.nc?

rmclaren commented 9 months ago

Are we treating warnings as errors for the purpose of CI?

srherbener commented 9 months ago

Are we treating warnings as errors for the purpose of CI?

Warnings are not treated as errors at this point, but we are attempting to resolve all the compiler warnings in JEDI. Thanks for addressing the warnings in this PR!

BenjaminRuston commented 9 months ago

@rmclaren that fixed 3 of the 4 for all compilers..... and 4 out of 4 for intel the CDASH output (though has a green check) can show the issues gnu and clang are seeing. Orion can also build with a gnu environment

but here's a few lines of what is failing in the ctest: test_iodaconv_prepbufr_ncep_aircftprofiles2ioda

DIFFER : VARIABLE : pressure : POSITION : [53,6] : VALUES : 5.395e-43 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [53,7] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [53,8] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [53,9] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [53,10] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [53,11] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [53,12] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [53,13] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [53,14] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [53,15] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [54,0] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [54,1] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [54,2] : VALUES : 1.4013e-45 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [54,3] : VALUES : 1.4013e-45 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [54,4] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [54,5] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [54,6] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
BenjaminRuston commented 9 months ago

to see that CDASH output you gotta click details image

then scroll to the bottom and view test summary (and click on failing test) image

rmclaren commented 9 months ago

@BenjaminRuston Thanks Ben, I was aware of this. This particular test is a little annoying as haven't been able to reproduce either on my machine or hera...

BenjaminRuston commented 9 months ago

@rmclaren thought you likely already knew those steps... and sorry to hear this has been difficult to track down. Do you think we should move forward with the fix you've found, and potentially disable the remaining ctest: test_iodaconv_prepbufr_ncep_aircftprofiles2ioda while we continue to investigate the solution?
Could ask others next week, but think that may be a course of action for the time being.

rmclaren commented 9 months ago

So I recompiled everything on HERA last using the spack-stack 1.5.1 GNU gcc/g++ (9.2.0) compiler and I'm not seeing the failure there. The only difference I can see is that your CI machines are using a different version of the GNU compiler (9.4.0).

rmclaren commented 9 months ago

Is there any chance the source data on the CI machines is corrupted in some way? Bad checkout?

BenjaminRuston commented 9 months ago

@rmclaren I did ask the infrastructure team re-examine the CI containers and @climbfuji verified these

that being said we should get some other eyes on this, @PatNichols please see if you notice anything that may suggest the cause of these last errors

PatNichols commented 9 months ago

@rmclaren @BenjaminRuston I will take a look on the AWS instance. To see if it passes.

PatNichols commented 9 months ago

@rmclaren What version of the gnu compiler are you using?

rmclaren commented 9 months ago

@PatNichols Like I said on Hera I'm using 9.2.0 because thats the version available via spack-stack 1.5.1.

rmclaren commented 9 months ago

Locally I'm using the clang compiler for now (version 14.0.6). My ubuntu virtual machine is using using GNU compiler version 11.4.0. I had previously been using the Intel compiler on HERA, but this one is not an issue.... The unit tests passes everywhere I have tried it...

PatNichols commented 9 months ago

@rmclaren Are you using spack stack? spack-stack-1.5.1 or 1..5.0?

PatNichols commented 9 months ago

@rmclaren I am seeing the same failure on my intel mac using spack-stack-1.5.1,clang version 14.0 and gfortran version 12. If this get too much to debug we can just eliminate the test for ci and deal with it in another PR later.

PatNichols commented 9 months ago

@rmclaren What seems to be doing is assigning 0 to missing values? That just a guess though. The reference file has many missing values.

BenjaminRuston commented 9 months ago

@rmclaren just reproduced this on Orion. Using your branch all ctests are passing except for:

test_iodaconv_prepbufr_ncep_aircftprofiles2ioda

this reproduces the same error our CI see. When I'm using the gcc compiler. Should be able to reproduce this on Orion with gnu by cloning the ufo-bundle and building with the environment:

# Purge all modules
module purge

# Load required modules
module use /work/noaa/epic/role-epic/spack-stack/orion/modulefiles
module load python/3.9.2
module load ecflow/5.8.4
module load mysql/8.0.31

# Gnu
module use /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.5.1/envs/unified-env/install/modulefiles/Core
module load stack-gcc/10.2.0
module load stack-openmpi/4.0.4
module load stack-python/3.10.8

module load jedi-fv3-env
module load ewok-env
module load soca-env

looking back looks like the gnu compiller is 10.2.0 ?

Orion-login-3: /iodaconv >gcc --version
gcc (GCC) 10.2.0
rmclaren commented 9 months ago

What result do you get if you do ncdump -h test/testrun/gdas.t12z.acft_profiles.prepbufr.nc

PatNichols commented 9 months ago

What result do you get if you do ncdump -h test/testrun/gdas.t12z.acft_profiles.prepbufr.nc

The look similar. Both have many missing values. I am looking closer to see where things are not lining up.

PatNichols commented 9 months ago

@rmclaren This is the output from you PR for the pressure in one part of the output file:

2.477496e-41, 0, 3.279038e-43, 0, 0, 0, 3.279038e-43, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4.040255e-38, 3.443831e-41, 4.040255e-38, 3.443831e-41, 4.040256e-38, 3.443831e-41, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4.04026e-38, 3.443831e-41, 4.04026e-38, 3.443831e-41, 4.040261e-38, 3.443831e-41, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4.040322e-38, 3.443831e-41, 4.040322e-38, 3.443831e-41, 4.040325e-38, 3.443831e-41, 0, 0, 0, 0, 0, 0, 0, 0, 4.04034e-38, 3.443831e-41, 4.04034e-38, 3.443831e-41, 4.040343e-38, 3.443831e-41, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4.040264e-38, 3.443831e-41, 4.040264e-38, 3.443831e-41, 4.040265e-38, 3.443831e-41, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4.040313e-38, 3.443831e-41, 4.040313e-38, 3.443831e-41, 4.040316e-38, 3.443831e-41, 0, 0, The reference is all missing values...

rmclaren commented 9 months ago

@PatNichols Thanks. I was actually looking to see the header information, not the data. Don't forget the -h option.

rmclaren commented 9 months ago

@PraveenKumar-NOAA Thanks!

BenjaminRuston commented 9 months ago

@rmclaren , @PatNichols and @climbfuji this is still failing on Orion following https://spack-stack.readthedocs.io/en/1.5.1/PreConfiguredSites.html#msu-orion

and loading these (always load all three though assume only the first would be needed for the ioda-bundle):

module load jedi-fv3-env
module load ewok-env
module load soca-env

it does look like the output has changed this is the Orion output for the failing test test_iodaconv_prepbufr_ncep_aircftprofiles2ioda:

105: DIFFER : VARIABLE : pressure : POSITION : [53,6] : VALUES : 2.47988e-41 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [53,7] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [53,8] : VALUES : 2.4775e-41 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [53,9] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [53,10] : VALUES : 1.43455e-37 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [53,11] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [53,12] : VALUES : 1.43455e-37 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [53,13] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [53,14] : VALUES : 1.43455e-37 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [53,15] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [54,12] : VALUES : 1.82169e-44 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [55,12] : VALUES : 1.82169e-44 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [56,12] : VALUES : 1.4013e-45 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [57,12] : VALUES : 1.4013e-45 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [58,12] : VALUES : 1.4013e-45 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [59,12] : VALUES : 1.82169e-44 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [60,12] : VALUES : 1.4013e-45 <> 3.40282e+38 : PERCENT : 100
climbfuji commented 9 months ago

@rmclaren , @PatNichols and @climbfuji this is still failing on Orion following https://spack-stack.readthedocs.io/en/1.5.1/PreConfiguredSites.html#msu-orion

and loading these (always load all three though assume only the first would be needed for the ioda-bundle):

module load jedi-fv3-env
module load ewok-env
module load soca-env

it does look like the output has changed this is the Orion output for the failing test test_iodaconv_prepbufr_ncep_aircftprofiles2ioda:

105: DIFFER : VARIABLE : pressure : POSITION : [53,6] : VALUES : 2.47988e-41 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [53,7] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [53,8] : VALUES : 2.4775e-41 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [53,9] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [53,10] : VALUES : 1.43455e-37 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [53,11] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [53,12] : VALUES : 1.43455e-37 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [53,13] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [53,14] : VALUES : 1.43455e-37 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [53,15] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [54,12] : VALUES : 1.82169e-44 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [55,12] : VALUES : 1.82169e-44 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [56,12] : VALUES : 1.4013e-45 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [57,12] : VALUES : 1.4013e-45 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [58,12] : VALUES : 1.4013e-45 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [59,12] : VALUES : 1.82169e-44 <> 3.40282e+38 : PERCENT : 100
105: DIFFER : VARIABLE : pressure : POSITION : [60,12] : VALUES : 1.4013e-45 <> 3.40282e+38 : PERCENT : 100

Is this "erratic" behavior in a sense that it passes on some platforms, but not on others, or that it even changes on one platform? That's often caused by uninitialized variables or out-of-bounds array access etc.

BenjaminRuston commented 9 months ago

@climbfuji just looked at a previous log, the order the processors report changes . the output does look the same ....

PRE-MAIN-INFO Exporting Data
PRE-MAIN-INFO Finished [0.04s]
DIFFER : VARIABLE : pressure : POSITION : [53,6] : VALUES : 2.47988e-41 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [53,7] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [53,8] : VALUES : 2.4775e-41 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [53,9] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [53,10] : VALUES : 1.85421e-37 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [53,11] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [53,12] : VALUES : 1.85421e-37 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [53,13] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [53,14] : VALUES : 1.85421e-37 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [53,15] : VALUES : 0 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [54,0] : VALUES : 1.4013e-45 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [55,0] : VALUES : 1.4013e-45 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [56,0] : VALUES : 1.4013e-45 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [57,0] : VALUES : 1.4013e-45 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [58,0] : VALUES : 1.4013e-45 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [59,0] : VALUES : 1.4013e-45 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [60,0] : VALUES : 1.4013e-45 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [61,0] : VALUES : 1.4013e-45 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [62,0] : VALUES : 1.4013e-45 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [63,0] : VALUES : 1.4013e-45 <> 3.40282e+38 : PERCENT : 100
DIFFER : VARIABLE : pressure : POSITION : [64,0] : VALUES : 1.4013e-45 <> 3.40282e+38 : PERCENT : 100
BenjaminRuston commented 9 months ago

@rmclaren

i discussed this with @ADCollard and he agreed we could remove this ctest while we track this down

so break that into it's own separate issue and PR, will create a branch off yours and do this just review and merge into your branch if you feel alright with placing test_iodaconv_prepbufr_ncep_aircftprofiles2ioda in the ongoing investigation bin

do think this is a canary telling us about a small initialization type step yet to be found

rmclaren commented 9 months ago

@BenjaminRuston I think the std namespace is being polutteds with inconsistent versions of the std library...

PatNichols commented 9 months ago

@rmclaren @BenjaminRuston I created a new PR to merge into this one that comments out the bad ctest.

BenjaminRuston commented 9 months ago

@rmclaren one last verification see you merged the deactivation of the ctest for now... move forward here and continue to debug test_iodaconv_prepbufr_ncep_aircftprofiles2ioda

are you able to duplicate the behavior on Orion ?

@climbfuji and @PatNichols did you see Ron's comment:

@BenjaminRuston I think the std namespace is being polutteds with inconsistent versions of the std library...

climbfuji commented 9 months ago

@rmclaren one last verification see you merged the deactivation of the ctest for now... move forward here and continue to debug test_iodaconv_prepbufr_ncep_aircftprofiles2ioda

are you able to duplicate the behavior on Orion ?

@climbfuji and @PatNichols did you see Ron's comment:

@BenjaminRuston I think the std namespace is being polutteds with inconsistent versions of the std library...

On which platform? Which compiler?

BenjaminRuston commented 9 months ago

the CI tests fail for clang and for gnu

the test test_iodaconv_prepbufr_ncep_aircftprofiles2ioda fails on my Mac with spack stack v1.5.1 and on Orion with v1.5.1 gnu using the modules:

# Purge all modules
module purge

# Load required modules
module use /work/noaa/epic/role-epic/spack-stack/orion/modulefiles
module load python/3.9.2
module load ecflow/5.8.4
module load mysql/8.0.31

# Gnu
module use /work/noaa/epic/role-epic/spack-stack/orion/spack-stack-1.5.1/envs/unified-env/install/modulefiles/Core
module load stack-gcc/10.2.0
module load stack-openmpi/4.0.4
module load stack-python/3.10.8

module load jedi-fv3-env
module load ewok-env
module load soca-env
rmclaren commented 9 months ago

Apparently my Orion account got deleted, as I've not used it enough... Feel free to merge this branch, I'll continue debugging on a different branch.

BenjaminRuston commented 9 months ago

@climbfuji I guess what we can use help with is standing up Ron on Hera? where this can reproduce..

@rmclaren was having trouble reproducing the behavior

BenjaminRuston commented 9 months ago

https://jointcenterforsatellitedataassimilation-jedi-docs.readthedocs-hosted.com/en/latest/using/jedi_environment/modules.html#hera

rmclaren commented 9 months ago

@BenjaminRuston Which project are you building? I always compile ioda-bundle directly.

BenjaminRuston commented 9 months ago

@BenjaminRuston Which project are you building? I always compile ioda-bundle directly.

yes @rmclaren I'm doing the same and just building the ioda-bundle to test the converters

rmclaren commented 9 months ago

@BenjaminRuston Thanks.