NOAA-EMC / NCEPLIBS-bufr

The NCEPLIBS-bufr library contains routines and utilites for working with the WMO BUFR format.
Other
44 stars 19 forks source link

DA libs do not work with GSI #9

Closed mark-a-potts closed 3 years ago

mark-a-potts commented 4 years ago

My understanding of how the dynamic allocation build of bufrlib is supposed to work is that if the sizing of the bufr is not specified, it defaults to the size used for the static allocation. This should make it possible to swap an SA library for a DA version. Unfortunately, this does not seem to work with the GSI using either the old makefile build system or the just completed CMake build system. Can this be fixed?

jbathegit commented 4 years ago

Hi Mark - can you provide some more details, like what error message(s) you're seeing? Also, what operating system are you working on?

mark-a-potts commented 4 years ago

Here is where the error gets thrown in the stdout of the first regression test for GSI on Hera--

================================================================================ in mbuoy_info,n_comps = 3 n_scripps = 40 n_triton = 70 n_3mdiscus = 153 in mbuoyb_info,n_comps = 3 n_scripps = 40 n_triton = 70 n_3mdiscus = 153 in read_ship_info, nship = 0 forrtl: severe (174): SIGSEGV, segmentation fault occurred Image PC Routine Line Source gsi.x 0000000001F468DD Unknown Unknown Unknown libpthread-2.17.s 00002B8AF3B555F0 Unknown Unknown Unknown gsi.x 00000000019AE71E status.V 118 status.f gsi.x 00000000019CC161 closbf_ 61 closbf.f gsi.x 00000000010C23AD readbufrtovs 504 read_bufrtovs.f90 gsi.x 0000000000816494 read_obsmod_mp_re 1611 read_obs.F90 gsi.x 000000000075A94B observermod_mpse 337 observer.F90 gsi.x 0000000000E4DC54 glbsoi 214 glbsoi.f90 gsi.x 0000000000629C70 gsisub_ 201 gsisub.F90

For this test, I built the GSI as usual, but replaced the libbufr_v11.3.0_d_64.a library with libbufr_v11.3.0_d_64_DA.a in link stage. You should be able to reproduce by cloning my GSI repo with "git clone --branch Bufr_DA_test --recurse-submodules https://github.com:mark-a-potts/GSI". After the clone completes (the fix files will take a while to download), you can build by cd'ing into GSI and running "./ush/build_all_cmake.sh 0 $PWD". Once the build completes, if you cd to GSI/build, you can run the first regression test with the command "ctest -I 1,1".

mark-a-potts commented 4 years ago

This should work on wcoss (dells) as well. Make sure to run a "module purge; module use path-to-GSI/modulefiles; module load modulefile.ProdGSI.wcoss_d" before running "ctest -I 1,1".

The output from the test will be in /gpfs/dell2/ptmp/$LOGNAME/$ptmpName/_gpfs_dell2_emc_modeling_noscrub_Mark.Potts_G2_build/tmpreg_global_T62/global_T62_loproc_updat/stdout

jbathegit commented 4 years ago

For some reason, I wasn't able to clone the above repository on mars:

Jeff.Ator@m71a3 [/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI (46)] % !! git clone --branch Bufr_DA_test --recurse-submodules https://github.com:mark-a-potts/GSI Cloning into 'GSI'... remote: Not Found fatal: repository 'https://github.com:mark-a-potts/GSI/' not found Jeff.Ator@m71a3 [/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI (47)] %

jbathegit commented 4 years ago

Is the repository blocked in some way, or some other permissions issue?

kgerheiser commented 4 years ago

Just a bad link (see the :)

https://github.com/mark-a-potts/GSI.git

jbathegit commented 4 years ago

Now I'm getting the following feedback:

Jeff.Ator@m72a1 [/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI (43)] % git clone --branch Bufr_DA_test --recurse-submodules https://github.com/mark-a-potts/GSI.git Cloning into 'GSI'... remote: Enumerating objects: 7, done. remote: Counting objects: 100% (7/7), done. remote: Compressing objects: 100% (6/6), done. remote: Total 61109 (delta 0), reused 2 (delta 0), pack-reused 61102 Receiving objects: 100% (61109/61109), 50.71 MiB | 22.37 MiB/s, done. Resolving deltas: 100% (40428/40428), done. Submodule 'fix' (gerrit:GSI-fix) registered for path 'fix' Submodule 'libsrc' (gerrit:GSI-libsrc) registered for path 'libsrc' Cloning into '/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/fix'... ssh: Could not resolve hostname gerrit: Name or service not known fatal: Could not read from remote repository.

Please make sure you have the correct access rights and the repository exists. fatal: clone of 'gerrit:GSI-fix' into submodule path '/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/fix' failed Failed to clone 'fix'. Retry scheduled Cloning into '/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/libsrc'... ssh: Could not resolve hostname gerrit: Name or service not known fatal: Could not read from remote repository.

Please make sure you have the correct access rights and the repository exists. fatal: clone of 'gerrit:GSI-libsrc' into submodule path '/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/libsrc' failed Failed to clone 'libsrc'. Retry scheduled Cloning into '/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/fix'... ssh: Could not resolve hostname gerrit: Name or service not known fatal: Could not read from remote repository.

Please make sure you have the correct access rights and the repository exists. fatal: clone of 'gerrit:GSI-fix' into submodule path '/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/fix' failed Failed to clone 'fix' a second time, aborting Jeff.Ator@m72a1 [/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI (44)] %

mark-a-potts commented 4 years ago

Ah, that means you don't have a gerrit alias set up. The fix files are still stored in Vlab. Probably the easiest thin to do is to just copy (or link) to my copy at /gpfs/dell2/emc/modeling/noscrub/Mark.Potts/ProdGSI/fix. After that, make sure you run "git submodule init libsrc; git submodule sync libsrc; git submodule update libsrc" to make sure that the libsrc submodule gets populated correctly.

-M

On 7/14/20 11:19 AM, Jeff Ator wrote:

Now I'm getting the following feedback:

Jeff.Ator@m72a1 [/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI (43)] % git clone --branch Bufr_DA_test --recurse-submodules https://github.com/mark-a-potts/GSI.git Cloning into 'GSI'... remote: Enumerating objects: 7, done. remote: Counting objects: 100% (7/7), done. remote: Compressing objects: 100% (6/6), done. remote: Total 61109 (delta 0), reused 2 (delta 0), pack-reused 61102 Receiving objects: 100% (61109/61109), 50.71 MiB | 22.37 MiB/s, done. Resolving deltas: 100% (40428/40428), done. Submodule 'fix' (gerrit:GSI-fix) registered for path 'fix' Submodule 'libsrc' (gerrit:GSI-libsrc) registered for path 'libsrc' Cloning into '/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/fix'... ssh: Could not resolve hostname gerrit: Name or service not known fatal: Could not read from remote repository.

Please make sure you have the correct access rights and the repository exists. fatal: clone of 'gerrit:GSI-fix' into submodule path '/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/fix' failed Failed to clone 'fix'. Retry scheduled Cloning into '/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/libsrc'... ssh: Could not resolve hostname gerrit: Name or service not known fatal: Could not read from remote repository.

Please make sure you have the correct access rights and the repository exists. fatal: clone of 'gerrit:GSI-libsrc' into submodule path '/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/libsrc' failed Failed to clone 'libsrc'. Retry scheduled Cloning into '/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/fix'... ssh: Could not resolve hostname gerrit: Name or service not known fatal: Could not read from remote repository.

Please make sure you have the correct access rights and the repository exists. fatal: clone of 'gerrit:GSI-fix' into submodule path '/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/fix' failed Failed to clone 'fix' a second time, aborting Jeff.Ator@m72a1 [/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI (44)] %

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/NCEPLIBS-bufr/issues/9#issuecomment-658241688, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH4Q2UXIMJFZWYFQTDFGO2TR3RZPNANCNFSM4OYY6PIQ.

jbathegit commented 4 years ago

OK, I made the symlink to your fix directory, then ran the init and sync steps, but the update step failed:

git submodule update libsrc Cloning into '/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/libsrc'... ssh: Could not resolve hostname gerrit: Name or service not known fatal: Could not read from remote repository.

Please make sure you have the correct access rights and the repository exists. fatal: clone of 'gerrit:GSI-libsrc' into submodule path '/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/libsrc' failed Failed to clone 'libsrc'. Retry scheduled Cloning into '/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/libsrc'... ssh: Could not resolve hostname gerrit: Name or service not known fatal: Could not read from remote repository.

Please make sure you have the correct access rights and the repository exists. fatal: clone of 'gerrit:GSI-libsrc' into submodule path '/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/libsrc' failed Failed to clone 'libsrc' a second time, aborting Jeff.Ator@m72a1 [/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI (60)] %

FYI, my own gerrit account is set up such that I normally need to enter a passcode whenever I clone from or push to repositories in VLab.

mark-a-potts commented 4 years ago

Argh. I think you should be able to just copy the libsrc directory from /gpfs/dell2/emc/modeling/noscrub/Mark.Potts/ProdGSI/libsrc. Sorry about that.

-M

On 7/14/20 12:16 PM, Jeff Ator wrote:

OK, I made the symlink to your fix directory, then ran the init and sync steps, but the update step failed:

git submodule update libsrc Cloning into '/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/libsrc'... ssh: Could not resolve hostname gerrit: Name or service not known fatal: Could not read from remote repository.

Please make sure you have the correct access rights and the repository exists. fatal: clone of 'gerrit:GSI-libsrc' into submodule path '/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/libsrc' failed Failed to clone 'libsrc'. Retry scheduled Cloning into '/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/libsrc'... ssh: Could not resolve hostname gerrit: Name or service not known fatal: Could not read from remote repository.

Please make sure you have the correct access rights and the repository exists. fatal: clone of 'gerrit:GSI-libsrc' into submodule path '/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/libsrc' failed Failed to clone 'libsrc' a second time, aborting Jeff.Ator@m72a1 [/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI (60)] %

FYI, my own gerrit account is set up such that I normally need to enter a passcode whenever I clone from or push to repositories in VLab.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/NCEPLIBS-bufr/issues/9#issuecomment-658273555, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH4Q2UTR6E3VZHCJCCBWD6DR3SAG7ANCNFSM4OYY6PIQ.

-- Mark A. Potts, Ph.D. Sr. HPC Software Developer RedLine Performance Solutions, LLC Phone 202-744-9469 Mark.Potts@noaa.gov mpotts@redlineperf.com

jbathegit commented 4 years ago

OK, I copied over the libsrc directory and was able to run the "./ush/build_all_cmake.sh 0 $PWD" step from the GSI directory. I then cd'ed to the build subdirectory and ran the "module purge" and "module use path-to-GSI/modulefiles" commands, but when I then try to run "module load modulefile.ProdGSI.wcoss_d" I get the following:

Lmod has detected the following error: The following module(s) are unknown: "modulefile.ProdGSI.wcoss_d"

Please check the spelling or version number. Also try "module spider ..." It is also possible your cache file is out-of-date; it may help to try: $ module --ignore-cache load "modulefile.ProdGSI.wcoss_d"

Also make sure that all modulefiles written in TCL start with the string #%Module

I checked but couldn't find the module using spider, and if I do a "module avail" I don't see anything under "path-to-GSI", which leads me to believe that the prior "module use path-to-GSI/modulefiles" didn't really do anything. I also don't see anything when I do a "find . -name path-to-GSI" from my main directory.

jbathegit commented 4 years ago

Never mind, I now see a modulefiles subdirectory in the main "GSI" directory, so I did a "module use" on that and now I can load the modulefile.ProdGSI.wcoss_d module file. Will now try the ctest command. Fingers crossed...

jbathegit commented 4 years ago

OK, I can reproduce the error now, but I'm not getting very much information out of the stack trace - in my stdout everything says "Unknown" for the routine name, whereas in Mark's stdout on hera it showed routine names and line numbers. How do I get that level of detail in my runs - is there some compile setting (or DEBUG option) that I need to set?

mark-a-potts commented 4 years ago

Yes. When you build with the build_all_cmake script, you can use this command from the GSI directory instead -- "./ush/build_all_cmake.sh DEBUG $PWD"

Thanks,

-M

On 7/14/20 3:16 PM, Jeff Ator wrote:

OK, I can reproduce the error now, but I'm not getting very much information out of the stack trace - in my stdout everything says "Unknown" for the routine name, whereas in Mark's stdout on hera it showed routine names and line numbers. How do I get that level of detail in my runs - is there some compile setting (or DEBUG option) that I need to set?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/NCEPLIBS-bufr/issues/9#issuecomment-658363325, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH4Q2USZBVMHPGZ6A3ZN2A3R3SVHHANCNFSM4OYY6PIQ.

jbathegit commented 4 years ago

OK thanks, that seemed to work. However, I was hoping that if I reset the environment variable $BUFR_LIBd_DA it might then do the build using a version of the BUFR library that was also compiled with the debug option. However, even if I reset this variable I still see:

BUFR library /gpfs/dell1/nco/ops/nwprod/lib/bufr/v11.3.0/ips/18.0.1/libbufr_v11.3.0_d_64_DA.a set via Environment variable

in the build output, which leads me to believe that it didn't pick up the new value of $BUFR_LIBd_DA within the cmake/Modules/FindBUFR.cmake script. So maybe I have to try hardcoding that value in the script(?)

mark-a-potts commented 4 years ago

Yeah, that is probably not working right. Here is how you can be sure you get the library you want linked in. From the GSI/build directory, open up src/gsi/CMakeFiles/gsi_DBG.x.dir/link.txt. That has the full link line used to compile the GSI in debug mode. Search for bufr and replace the two instances it shows up with the full path to the library you want to use. After that, delete the GSI/build/bin/gsi_DBG.x file and run "make" from the GSI/build directory again. It will re-link with the new library.

-M

On 7/14/20 4:34 PM, Jeff Ator wrote:

OK thanks, that seemed to work. However, I was hoping that if I reset the environment variable $BUFR_LIBd_DA it might then do the build using a version of the BUFR library that was also compiled with the debug option. However, even if I reset this variable I still see:

BUFR library /gpfs/dell1/nco/ops/nwprod/lib/bufr/v11.3.0/ips/18.0.1/libbufr_v11.3.0_d_64_DA.a set via Environment variable

in the build output, which leads me to believe that it didn't pick up the new value of $BUFR_LIBd_DA within the cmake/Modules/FindBUFR.cmake script. So maybe I have to try hardcoding that value in the script(?)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/NCEPLIBS-bufr/issues/9#issuecomment-658398815, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH4Q2UT2W263HSWIBHR3PZTR3S6LXANCNFSM4OYY6PIQ.

jbathegit commented 4 years ago

OK, one more question - is there any way to run this thing using an interactive debugger such as gdb? The only executable I can find anywhere is "ctest", but that doesn't contain any internal debugging symbols (even though I used "DEBUG" as the first argument to ./ush/build_all_cmake.sh!?), so it's basically useless within gdb. What I was hoping was to be able to manually run and step through, say, the gsimain.f90 code, so I could figure out exactly where/why it's SIGSEGV faulting.

The only other alternative I see is to just start putting in print statements everywhere, but of course that's a huge time sink b/c I then have to go back and recompile the entire package every time I change something. And the line numbers in closbf and status where it says it's failing are very puzzling.

mark-a-potts commented 4 years ago

That is a tough one. This test runs on 56 cores I think, so putting it in gdb is going to be hard to work with. Let me see if there is a test that runs on fewer cores (and also fails) that might work.

-M

On 7/14/20 6:14 PM, Jeff Ator wrote:

OK, one more question - is there any way to run this thing using an interactive debugger such as gdb? The only executable I can find anywhere is "ctest", but that doesn't contain any internal debugging symbols (even though I used "DEBUG" as the first argument to ./ush/build_all_cmake.sh!?), so it's basically useless within gdb. What I was hoping was to be able to manually run and step through, say, the gsimain.f90 code, so I could figure out exactly where/why it's SIGSEGV faulting.

The only other alternative I see is to just start putting in print statements everywhere, but of course that's a huge time sink b/c I then have to go back and recompile the entire package every time I change something. And the line numbers in closbf and status where it says it's failing are very puzzling.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/NCEPLIBS-bufr/issues/9#issuecomment-658440211, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH4Q2UWPKBRAJDBV3E6BENTR3TKGHANCNFSM4OYY6PIQ.

-- Mark A. Potts, Ph.D. Sr. HPC Software Developer RedLine Performance Solutions, LLC Phone 202-744-9469 Mark.Potts@noaa.gov mpotts@redlineperf.com

jbathegit commented 4 years ago

Thanks Mark - I'd really appreciate anything you or anyone else could do to narrow down the scope of what I need to look at here. Just some small code snippet which runs on a single processor but exhibits the same behavior would be ideal, though I'm guessing probably also wishful-thinking ;-) I'm not at all familiar with the GSI hierarchy nor even with running any sort of code across multiple cores, so this is a bit overwhelming for me trying to get a handle on this.

According to the stack trace in the stdout from my DEBUG runs, it looks like the progression of F90 calls is gsimain->gsimod->gsisub->glbsoi->observer->read_obs->read_iasi, which is quite a mound (literally thousands of lines of code) to try to dig through just to get to the point where a call to the BUFRLIB subroutine closbf then seems to trigger a SIGSEGV. And closbf itself is a very straightforward subroutine which basically just closes a bunch of open FORTRAN logical units, so it's pretty innocuous, and the print statements I've added so far don't show any immediate clues as to where the real problem may lie. As with many segfault errors, the real memory violation could be far removed from where the abort actually shows up in the stack trace.

So again, if there's any way for you or the GSI folks (or anyone?) to narrow down the scope of this for me, or just isolate some smaller snippet of code (maybe just all or part of the read_iasi code?) which leads to the same segfault, I'd really appreciate it! Otherwise, I'm really grasping at straws right now trying to figure out where to look next.

aerorahul commented 4 years ago

@jbathegit i can setup something you can work with on a single PE.

aerorahul commented 4 years ago

@mark-a-potts @jbathegit Just an heads up! While working on getting to read the IASI bufr file in a standalone program, I noticed in the read_iasi.f90 on L417 is a call to close the bufr file immediately followed by an open statement.

call closbf(lnbufr)
open(lnbufr,file=trim(infile2),form='unformatted',status='old',iostat=ierr)

The bufr file has not been opened yet, as far as I can tell. Could there be a runaway condition occurring here? I will continue to extract the bufr bits from this code.

@mark-a-potts It might be worth a shot to comment that line L417 and run the test.

mark-a-potts commented 4 years ago

I think that might have been the problem. I need to do a little more testing, but the code got further that time before crashing in crtm, which is does when it is in debug mode.

mark-a-potts commented 4 years ago

Great catch @aerorahul!

jbathegit commented 4 years ago

Yes, great catch @aerorahul If the file pointed to by logical unit lnbufr hasn't been "opened" yet, then that could indeed explain it. You'll note I put "opened" in quotation marks here, because I'm talking about opening the file to the BUFRLIB via a call to subroutine openbf, as opposed to just using a FORTRAN open statement to link a filename to a logical unit number. The reason this is significant is because, when using a DA build of the BUFRLIB, the first call to subroutine openbf is also where all of the internal memory arrays for the BUFRLIB actually get dynamically allocated, based on any sizes specified during earlier calls to function isetprm, or else based on the system defaults built into the BUFRLIB. The point is that, until that first call to subroutine OPENBF is made, there's no internal memory available within the BUFRLIB, which means the library itself is basically unusable, and so any call to any other routine (such as closbf) which tries to access such space could certainly trigger a SIGSEGV violation. It's always been kind of presumed that nobody would ever try to call closbf without having first called openbf somewhere else in the code, but that wouldn't have triggered a segfault previously if you weren't using a DA build of the BUFRLIB, because the needed memory would have already been allocated at compile time.

mark-a-potts commented 4 years ago

Well, commenting out the closbf call allowed the global_T62 case to run to completion in the loproc configuration, but in the hiproc configuration, it still crashes in closbf (line 61) calling status.f (line 117). I'll see if I can get more information on that using the debug build.

-M

On 7/15/20 9:50 AM, Jeff Ator wrote:

Yes, great catch @aerorahul https://github.com/aerorahul If the file pointed to by logical unit lnbufr hasn't been "opened" yet, then that could indeed explain it. You'll note I put "opened" in quotation marks here, because I'm talking about opening the file to the BUFRLIB via a call to subroutine openbf, as opposed to just using a FORTRAN open statement to link a filename to a logical unit number. The reason this is significant is because, when using a DA build of the BUFRLIB, the first call to subroutine openbf is also where all of the internal memory arrays for the BUFRLIB actually get dynamically allocated, based on any sizes specified during earlier calls to function isetprm, or else based on the system defaults built into the BUFRLIB. The point is that, until that first call to subroutine OPENBF is made, there's no internal memory available within the BUFRLIB, which means the library itself is basically unusable, and so any call to any other routine (such as closbf) which tries to access such space could certainly trigger a SIGSEGV violation. It's always been kind of presumed that nobody would ever try to call closbf without having first called openbf somewhere else in the code, but that wouldn't have triggered a segfault previously if you weren't using a DA build of the BUFRLIB, because the needed memory would have already been allocated at compile time.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/NCEPLIBS-bufr/issues/9#issuecomment-658779924, or unsubscribe https://github.com/notifications/unsubscribe-auth/AH4Q2URRORQKYOXNHILNLRDR3WX3ZANCNFSM4OYY6PIQ.

aerorahul commented 4 years ago

Yes, great catch @aerorahul If the file pointed to by logical unit lnbufr hasn't been "opened" yet, then that could indeed explain it. You'll note I put "opened" in quotation marks here, because I'm talking about opening the file to the BUFRLIB via a call to subroutine openbf, as opposed to just using a FORTRAN open statement to link a filename to a logical unit number. The reason this is significant is because, when using a DA build of the BUFRLIB, the first call to subroutine openbf is also where all of the internal memory arrays for the BUFRLIB actually get dynamically allocated, based on any sizes specified during earlier calls to function isetprm, or else based on the system defaults built into the BUFRLIB. The point is that, until that first call to subroutine OPENBF is made, there's no internal memory available within the BUFRLIB, which means the library itself is basically unusable, and so any call to any other routine (such as closbf) which tries to access such space could certainly trigger a SIGSEGV violation. It's always been kind of presumed that nobody would ever try to call closbf without having first called openbf somewhere else in the code, but that wouldn't have triggered a segfault previously if you weren't using a DA build of the BUFRLIB, because the needed memory would have already been allocated at compile time.

@jbathegit Is there a reason why one has to do the Fortran open before a call to openbf? Why can't openbf internally call Fortran open and similarly closbf call Fortran close?

jbathegit commented 4 years ago

@aerorahul this was a conscious design decision made a long time (i.e. decades) ago when the library was first developed, in order to allow maximum portability to different systems which might have different extensions to the Fortran OPEN statement. For example, for a long time SGI-based systems had a non-standard FORM="SYSTEM" extension which allowed a Fortran read of a file as a binary stream without control words. And some systems also had implicit ways to associate logical unit numbers with files on the system outside of an actual Fortran OPEN statement, e.g. using an assign directive, or by simply just naming the file as "fort.#" where # is the logical unit number. Bottom line - a conscious decision was made way back when to keep BUFRLIB as flexible as possible by not including the OPEN statement inside of subroutine openbf.

mark-a-potts commented 4 years ago

So, I added a check in closbf.f (and then renamed it closbf.F) that simply returns if arrays haven't yet been allocated. This seems to fix the "problem" with the GSI. I put problem in quotes because I am not sure it is really a problem with bufrlib as much as with GSI, but I think it does make bufrlib more robust to have this check in place. It looks like this--

  USE MODA_NULBFR

  INCLUDE 'bufrlib.prm'

C----------------------------------------------------------------------- C-----------------------------------------------------------------------

ifdef DYNAMIC_ALLOCATION

  if(.not. allocated(NULL) ) then
    write(6,*) 'WARNING calling closbf without having called openbf'
    return
  endif

endif

  CALL STATUS(LUNIT,LUN,IL,IM)
  IF(IL.GT.0 .AND. IM.NE.0) CALL CLOSMG(LUNIT)
  IF(IL.NE.0 .AND. NULL(LUN).EQ.0) CALL CLOSFB(LUN)
  CALL WTSTAT(LUNIT,LUN,0,0)

C CLOSE fortran UNIT IF NULL(LUN) = 0 C -----------------------------------

  IF(NULL(LUN).EQ.0) CLOSE(LUNIT)

  RETURN
  END
edwardhartnett commented 3 years ago

Is this issue still active?

mark-a-potts commented 3 years ago

I think it is resolved now.

jbathegit commented 3 years ago

Yes, it's resolved. Even though this is really an issue that should be fixed in the GSI code, I agreed to pull the CLOSBF "fix" over into the code baseline.