MESH-Model / MESH-Dev

This repository contains the official MESH development code, which is the basis for the 'tags' listed under the MESH-Releases repository. The same tags are listed under this repository. Legacy branches and utilities have also been ported from the former SVN (Subversion) repository. Future developments must create 'forks' from this repository.
Other
2 stars 3 forks source link

1860 Segmentation fault on Amazon Cloud #48

Closed mee067 closed 1 month ago

mee067 commented 4 months ago

I compiled r1860_ME and r1860_ME_ZT on Amazon Cloud (AM Linux 2023 - using gfortran 11.4.1) and I got a segmentation fault when trying to run it within the Yukon Forecasting System.

` RUNCLASS36 is active.
   BASEFLOW component is ACTIVE.

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x7f1ecd423832 in ???
#1  0x7f1ecd422a05 in ???
#2  0x7f1ecd054dcf in ???
#3  0x4a6e8a in ???
#4  0x4aeaa8 in ???
#5  0x63ef20 in ???
#6  0x4024bc in ???
#7  0x7f1ecd03feaf in ???
#8  0x7f1ecd03ff5f in ???
#9  0x4024f4 in ???
#10  0xffffffffffffffff in ???`

I also got the same segmentation fault issue on an older instance (AM Linux 1 - AL AM 2018.03 which reached its end of life forcing me to create a new instance with the recent Linux version). This instance has gfortran 6.4.1.

I tried to compile older code (r1745) on the new instance but it gives errors:

`./LSS_Model/CLASS/3.6/src/CLASSW.f:638:49:

  450 |      4                 ISAND, IWF, IG, ILG, IL1, IL2, BULK_FC,
      |                       2
......
  638 |      3                 DELZW, THPOR, THLMIN, BI, DIDRN,
      |                                                 1
Error: Type mismatch between actual argument at (1) and actual argument at (2) (REAL(4)/INTEGER(4)).
./LSS_Model/CLASS/3.6/src/CLASSW.f:711:49:

  450 |      4                 ISAND, IWF, IG, ILG, IL1, IL2, BULK_FC,
      |                       2
......
  711 |      3                 DELZW, THPOR, THLMIN, BI, DIDRN,
      |                                                 1
Error: Type mismatch between actual argument at (1) and actual argument at (2) (REAL(4)/INTEGER(4)).`

Any ideas?

kasra-keshavarz commented 4 months ago

It's worth trying with Ubuntu EC2 instances. I've tested recently with Ubuntu and it seems to work just fine.

mee067 commented 4 months ago

Thanks, that also came to my mind

mee067 commented 4 months ago

same segmentation fault on Amazon cloud ubuntu instance for r1860. Only this time I get some more info on the routine throwing the error:

` RUNCLASS36 is active. BASEFLOW component is ACTIVE.

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

0 0x7f2cc1823960 in ???

1 0x7f2cc1822ac5 in ???

2 0x7f2cc144251f in ???

3 0x560859f26ca7 in __output_variables_MOD_output_variables_group_update_ts

4 0x560859f2e8a8 in __output_variables_MOD_output_variables_update

5 0x56085a0c7ded in MAIN__

6 0x560859e804ce in main

./scripts/perform_capa_hindcast.sh: line 39: 22888 Segmentation fault (core dumped) $mesh_exe`

Also same compilation error for r1745 on ubuntu. I think it is related to the version of gfortran which is almost same on both Amazon Linux (11.4.1) and Ubuntu (11.4.0)

mee067 commented 4 months ago

I looked at the compilation error for r1745. I am not sure which variable but I looked at ISAND and it is integer all the way.

I also found that the new makefile (1860) has some additional compiler options compared to 1745 which may have suppressed the conversion issue for 1860. I am not very conversant with compiler options but I think it could be (-Wconversion). But it is there in the makefile of 1745 - so I am not sure what it is the issue and I am not sure which variable gets converted implicitly.

Any feedback?

mee067 commented 4 months ago

I recompiled r1860 with symbols on and this is what I got:

`Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

0 0x7f03b6023960 in ???

1 0x7f03b6022ac5 in ???

2 0x7f03b5c4251f in ???

3 0x55995a34b1c6 in __output_variables_MOD_output_variables_group_update_ts

    at ./Driver/MESH_Driver/output_variables.f90:1389

4 0x55995a34819c in __output_variables_MOD_output_variables_update_ts

    at ./Driver/MESH_Driver/output_variables.f90:2078

5 0x55995a3446d8 in __output_variables_MOD_output_variables_update

    at ./Driver/MESH_Driver/output_variables.f90:2530

6 0x55995a669415 in runmesh

    at ./Driver/MESH_Driver/MESH_driver.f90:847

7 0x55995a66dd89 in main

    at ./Driver/MESH_Driver/MESH_driver.f90:97

./perform_capa_hindcast.sh: line 39: 32760 Segmentation fault (core dumped) $mesh_exe`

line 97 in MESH_driver is "use mpi_module" which is not active in this compilation and there is a stub for it so I am not sure why it objects.

I traced the rest, and line 1389 in out_variables.f90 is the second line in this block:

        if (associated(group%ican)) then
            where (group%tacan > 0.0)
                group%ican = 1.0
            elsewhere
                group%ican = 0.0
            end where
        end if

tacan, qacan, and uvcan are new variables which I added as outputs - I copied their blocks from tcan. Maybe this needs review @dprincz. I think tacan and qacan were already there internally but not as outputs, uvcan wasn't. I used to compile with intel and did not get that issue and they did produce the required output when I tested them.

mee067 commented 4 months ago

So, I managed to compile 1860 on the AM 2023 linux after commenting some code blocks related to tacan, qacan and uvcan outputs.

But compiling older code hits that issue related to type mismatch. Comparing CLASSW.f and RUNCLASS_module.f90 across versions does not indicate where is the problem. I need to run older code for some of the setups that I could not fully migrate to 1860.

dprincz commented 1 month ago

This might be a known issue I've fixed.

Find and update this block in output_variables.f90 from this:

        if (associated(group%ican)) then
            where (group%tacan > 0.0)
                group%ican = 1.0
            elsewhere
                group%ican = 0.0
            end where
        end if
        if (associated(group%ican)) then
            where (group%qacan > 0.0)
                group%ican = 1.0
            elsewhere
                group%ican = 0.0
            end where
        end if
        if (associated(group%ican)) then
            where (group%uvcan > 0.0)
                group%ican = 1.0
            elsewhere
                group%ican = 0.0
            end where
        end if

To this:

        if (associated(group%ican) .and. associated(group%tacan)) then
            where (group%tacan > 0.0)
                group%ican = 1.0
            elsewhere
                group%ican = 0.0
            end where
        end if
        if (associated(group%ican) .and. associated(group%qacan)) then
            where (group%qacan > 0.0)
                group%ican = 1.0
            elsewhere
                group%ican = 0.0
            end where
        end if
        if (associated(group%ican) .and. associated(group%uvcan)) then
            where (group%uvcan > 0.0)
                group%ican = 1.0
            elsewhere
                group%ican = 0.0
            end where
        end if

Please close the thread if this resolves the issue.

mee067 commented 1 month ago

Well, that resolved the issue on AWS. I did a simple test after restoring the canopy level outputs.

Before we close this issue, please answer the following questions: ICANis the number of PFTs which is 4 (i.e. fixed) - why are those canopy level outputs are conditioned on ICAN? I also forgot the difference between tcanand tacan! Vegetation temperature vs air temperature within the canopy as CLASS defines them. They do not sound very different, do they?

Note that all canopy level variables already existed in CLASS, I just exposed them to MESH to get output. Even the MESH variables TACAN and QACAN were there. Only UVCAN wasn't.

dprincz commented 1 month ago

Different ican. If you look in output_variables, you'll see a few i-values which are used for calculating averages when only valid values should be considered (e.g., shortwave radiation, snow, etc..). These are averaging counters local to the routine.

This is by design so the equivalent of NO_DATA values, when it's not appropriate to consider "0.0" among an average value, are omitted when calculating a representative average value.

I believe tcan is the temperature of the canopy while tacan is the ambient temperature within the canopy. I think they should be similar. From my understanding, this is why tacan as a prognostic state is set to tcan when passing between time-steps and resuming previous run-states.

mee067 commented 1 month ago

so ican = number of canopies within a tile which has a maximum of 4. It can be zero if FCAN(1..4) = 0, so it is either rock (FCAN(5) = 0) or some impervious cover (FCAN(5)>0), right? This protects against the case of having an impervious type only like glaciers, water, or urban tiles.

btw, I know that if sum(FCAN(1..5)) < 1, it will assign the remainder to rock with hard-coded properties. What if the sum > 1, does it scale things down to sum to 1?

dprincz commented 1 month ago

Hi Mohamed,

ican in this context doesn't have anything to do with CLASS's definition for ican. For questions specific for CLASS's instance of ican and icp1, I suggest creating a separate issue tagged for documentation.

Dan

mee067 commented 1 month ago

ok, will move the question regarding CLASS ican and fcan to another thread. For output purposes, how is icanassigned? I searched the module and could not figure things.