ec-jrc / lisflood-lisvap

Lisflood OS (LISVAP)
https://ec-jrc.github.io/lisflood-lisvap/
European Union Public License 1.2
8 stars 6 forks source link

Bus error - possible memory leak? #39

Open farinfa opened 3 years ago

farinfa commented 3 years ago

Hi,

Issue already posted on the JEODPP help page I'm running LISVAP on a JEODPP terminal. The installation of the model in JEODPP was completed successfully few months ago and the model was used on this infrastructure several times with no issues. I have used the same settings in the past to process very large datasets (more than 150 years), last week, I was running this for a 40 years (14974 time steps) process and it stopped several times between days 6000 and 13000 for a "bus error" (see the screenshot attached). image Apparently the issue was related with the shared memory available in the terminal (5GB for all the users): I got access to a terminal with 25GB of memory available and the process run smoothly. Now, I'm running a longer process (more than 50000 time steps) on this last terminal, and I'm having again the "bus error" issue (at around 20k time steps). Apparently this is the only process running on the terminal (no other users at the moment). Could the issue be related to a memory leak? (but it was working in the past....) Thanks

gnrgomes commented 3 years ago

Hi, Are you using the latest version?

While we try to investigate the eventual memory leak, if it is possible for you, please try to generate the output splited by year. You just need to add this flag to your setting file next to the other setoption flags.

Basically you will get multiple files like: es_1975.nc es_1976.nc ...

farinfa commented 3 years ago

Hi,

this is the version I'm using image

I'm trying again flagging the splitOutput option as you suggested.
In case it stops again, is there a way I can restart from the year in which it stopped? (Would this be possible just by changing the StepStart and reducing the number of time steps accordingly?) Thanks

gnrgomes commented 3 years ago

Could you please update your version because the latest one is 1.0.0 which have the splitOutput flag.

farinfa commented 3 years ago

Yes, sorry. In fact now it is not generating the yearly outputs...

farinfa commented 3 years ago

I have tried to install the newer version, by using the second option of the installation guide, using the pip install lisflood-lisvap command inside a conda virtual environment, as described in this post: the version is the same. image

image

gnrgomes commented 3 years ago

Have you tried to upgrade? pip install --upgrade lisflood-lisvap

farinfa commented 3 years ago

pip install --upgrade lisflood-lisvap image

gnrgomes commented 3 years ago

@farinfa could you please upgrade now?

pip install --upgrade lisflood-lisvap

farinfa commented 3 years ago

Hi @gnrgomes The pip install --upgrade lisflood-lisvap works now. The procedure ends with the successfully installed message: image

In order to make it work, I had to install pyproj The version installed now is the 1.0.2 image

When I try to run the model with my data, however, I get a long list of errors: did the settings file change? image

gnrgomes commented 3 years ago

It changed to have some options that have a default value, meaning it should run regardless of your settings file. Could you please share your settings file?

farinfa commented 3 years ago

Sorry for my late reply: I wanted to run some tests before getting back on this.

It changed to have some options that have a default value, meaning it should run regardless of your settings file.

Sorry, my bad: I misspelled something and that was causing the IndexError you saw in my previous post.

Could you please share your settings file?

Enclosed settings_FF_sample.txt

I managed to run the updated version of LISVAP with my data, but this did not solve the 'Bus Error' original problem (I still get the process killed before 20k time steps). The new LISVAP does manage, however, to complete the full simulation with the option "splitOutput" flagged. Could this 'Bus Error' be related to the output data temporary storage and writing process? (again, this was working with no issues in the past)

gnrgomes commented 3 years ago

A bus error it usually means that you are trying to access memory that does not exist, meaning you exceed the limit of usable memory. Using my output files I could estimate that only one of your files might be around 28 GB for the full 151 years, which will be all in memory and it means that the swap memory (in disk) will also be used. This option for splitOutput will only keep in memory 1 year at a time, meaning it will save quite a lot of memory and eventually making the program run faster, because it will not use the swap memory, thus reducing disk accesses. If you can... use this option. NOTE: There is also the splitInput option that consumes the input as splited files. These options work independently from each other.

farinfa commented 3 years ago

OK, I'll use it this way.

Thank you

farinfa commented 3 years ago

Sorry to bother again @gnrgomes: the 'Bus error' issue is now happening even with the splitOutput setting flagged...

gnrgomes commented 3 years ago

No problem @farinfa, your input about the project is most welcome.

You are working with a very large dataset compared to our usual dataset of 30 years. Could you please tell me your input files sizes?

farinfa commented 3 years ago

In the order of about 150GB each. Again, I want to point out that I have already processed these same data in the past and it was working with no issues (I am now rerunning them due to a change in the base map).

gnrgomes commented 3 years ago

You were able to run using these inputs on Lisvap v0.4.4 ? Or what version did you use? It is important to know this so I can check the differences in the code. Are you using the same server to run with the same setup? Did something change on your side? Do you have enough disk space?

farinfa commented 3 years ago

You were able to run using these inputs on Lisvap v0.4.4 ? Or what version did you use? It is important to know this so I can check the differences in the code.

Yes, I run these on JEODPP with LISVAP version 0.4.4 (now updated).

Are you using the same server to run with the same setup? Did something change on your side?

I first asked this to the JEODPP people and they said there is no change on their side. Later they gave me access to a terminal with more memory, but the issue persisted.

Do you have enough disk space?

More than 1.5TB free space

gnrgomes commented 3 years ago

I will investigate for a possible memory leak on LISVAP code or in any of the libraries used. I'll be back to you as soon as I can.