PecanProject / pecan

The Predictive Ecosystem Analyzer (PEcAn) is an integrated ecological bioinformatics toolbox.
www.pecanproject.org
Other
202 stars 234 forks source link

Too many open files issue again? #775

Open serbinsh opened 8 years ago

serbinsh commented 8 years ago

I just got this error running PEcAn at command with SIPNET. We used to get errors like this a lot in the past. Not sure if this is a gremlin in the machine or this issue is resurfacing. Documenting this issue here for now until i can be sure.

[sserbin@modex US-Syv]$ vi workflow.log
[sserbin@modex US-Syv]$ tail -f workflow.log
   Reading output for Years: 2001 - 2005 in directory:
   /data/sserbin/Modeling/sipnet/US-Syv/pecan.1/out/2000009976 including
   files 2002.nc 2003.nc 2004.nc 2005.nc
Error in R_nc4_open: Too many open files
Error in nc_open(ncfile) :
  Error in nc_open trying to open file /data/sserbin/Modeling/sipnet/US-Syv/pecan.1/out/2000009976/2005.nc
Calls: get.results ... read.ensemble.output -> sapply -> lapply -> read.output -> nc_open
Error during wrapup: cannot open the connection
Execution halted
serbinsh commented 8 years ago

More errors like this. This time with a DALEC run:

2016-04-02 12:19:04 INFO   [read.output] : 
   Reading output for Years: 2000 - 2002 in directory: 
   /data/Model_Output/pecan.output/PEcAn_2000000138/out/2000010646 
   including files 2000.nc 2001.nc 2002.nc 
Error in R_nc4_open: Too many open files
Error in nc_open(ncfile) : 
  Error in nc_open trying to open file /data/Model_Output/pecan.output/PEcAn_2000000138/out/2000010646/2001.nc
Calls: run.ensemble.analysis ... ensemble.ts -> read.ensemble.ts -> read.output -> nc_open
Error during wrapup: cannot open the connection
Execution halted
Error: failed to load sRGB colorspace file
Error during wrapup: cannot open the connection
Execution halted
sh: error while loading shared libraries: libtinfo.so.5: cannot open shared object file: Error 24
serbinsh commented 8 years ago

Here is the run setup:

pecan.xml

<?xml version="1.0" encoding="UTF-8"?>
<pecan>
 <info>
  <notes>test run DALEC ponderosa pine</notes>
  <userid>2000000002</userid>
  <username>Shawn P. Serbin</username>
  <date>2016/04/02 11:44:53 -0400</date>
 </info>
 <outdir>/data/Model_Output/pecan.output/PEcAn_2000000138</outdir>
 <database>
  <bety>
   <user>bety</user>
   <password>bety</password>
   <host>localhost</host>
   <dbname>bety</dbname>
   <driver>PostgreSQL</driver>
   <write>TRUE</write>
  </bety>
 </database>
 <pfts>
  <pft>
   <name>dalec.pondersoa_pine</name>
   <constants>
    <num>1</num>
   </constants>
   <outdir>/data/Model_Output/pecan.output/PEcAn_2000000138/pft/dalec.pondersoa_pine</outdir>
  </pft>
  <pft>
   <name>dalec.pondersoa_pine</name>
   <constants>
    <num>2</num>
   </constants>
   <outdir>/data/Model_Output/pecan.output/PEcAn_2000000138/pft/dalec.pondersoa_pine</outdir>
  </pft>
 </pfts>
 <meta.analysis>
  <iter>3000</iter>
  <random.effects>FALSE</random.effects>
  <threshold>1.2</threshold>
  <update>AUTO</update>
 </meta.analysis>
 <ensemble>
  <size>200</size>
  <variable>NPP</variable>
  <start.year>2000</start.year>
  <end.year>2002</end.year>
 </ensemble>
 <sensitivity.analysis>
  <quantiles>
   <sigma>-3</sigma>
   <sigma>-2</sigma>
   <sigma>-1</sigma>
   <sigma>0</sigma>
   <sigma>1</sigma>
   <sigma>2</sigma>
   <sigma>3</sigma>
  </quantiles>
  <variable>NPP</variable>
  <start.year>2000</start.year>
  <end.year>2002</end.year>
 </sensitivity.analysis>
 <model>
  <id>1000000002</id>
  <type>DALEC</type>
  <revision>Quaife Evergreen RSE08</revision>
  <delete.raw>FALSE</delete.raw>
  <binary>/data/software/dalec_EnKF_pub//dalec_seqMH</binary>
 </model>
 <workflow>
  <id>2000000138</id>
 </workflow>
 <run>
  <site>
   <id>766</id>
   <met.start>1999-01-01 00:00:00</met.start>
   <met.end>2003-12-31 00:00:00</met.end>
   <name>Metolius-first young aged pine (US-Me5)</name>
   <lat>44.4372</lat>
   <lon>-121.567</lon>
  </site>
  <inputs>
   <met>
    <id>2000000004</id>
    <path>/data/Model_Data/dalec_EnKF_pub/input_data/oregon/dalec_drivers.OREGON.no_obs.dat</path>
   </met>
  </inputs>
  <start.date>2000/01/01</start.date>
  <end.date>2002/12/31</end.date>
  <dbfiles>/data/Model_Output/pecan.output/dbfiles</dbfiles>
  <host>
   <name>localhost</name>
   <qsub>qsub -V -N @NAME@ -o @STDOUT@ -e @STDERR@ -S /bin/bash</qsub>
   <qsub.jobid>Your job ([0-9]+) .*</qsub.jobid>
   <qstat>qstat @JOBID@ || echo DONE</qstat>
   <rundir>/data/Model_Output/pecan.output/PEcAn_2000000138/run</rundir>
   <outdir>/data/Model_Output/pecan.output/PEcAn_2000000138/out</outdir>
  </host>
 </run>
 <email>
  <to>sserbin@bnl.gov</to>
  <url>https://modex.bnl.gov:443/pecan/08-finished.php?workflowid=2000000138</url>
 </email>
 <rundir>/data/Model_Output/pecan.output/PEcAn_2000000138/run</rundir>
 <modeloutdir>/data/Model_Output/pecan.output/PEcAn_2000000138/out</modeloutdir>
</pecan>
serbinsh commented 8 years ago

Need to check the get.model.output code to make sure files are being closed properly

ashiklom commented 8 years ago

I think I might be having a related problem. Reading ensemble outputs is snappy at first but then becomes progressively slower as more files are loaded. I think there might be a memory leak in the read.output function. Will try to investigate.

ashiklom commented 8 years ago

I think I may have found the issue. The offending lines are 126-128 in read.outputs.

I think the issue is that each nc file is opened twice by nc_open -- once in the tryl function and again on the next line. Only the latter is closed, so as we loop through a bunch of files, we accumulate a memory leak. I'll open a pull request to try to fix this.

ashiklom commented 8 years ago

Never mind -- I tried that fix and it's still just as slow.

I wanted to run Rprofmem() on it to see if anything was using up a lot of memory, but apparently R can't do it on pecan2 -- "Memory profiling not available on this system". I might try it on my local machine.

bcow commented 8 years ago

Is this something that needs root permissions? You can send me what you were going to run and I try it on test-pecan.

ashiklom commented 8 years ago

@bcow You can give it a whirl, but I don't think so. It's something you run from within R, so it shouldn't need root permissions. From the documentation, it seems like you'd need to recompile R with specific settings enabled, which we definitely don't want to try on a production machine.

Here are the relevant man pages:

https://stat.ethz.ch/R-manual/R-devel/library/utils/html/Rprof.html https://stat.ethz.ch/R-manual/R-devel/library/utils/html/Rprofmem.html

serbinsh commented 8 years ago

@ashiklom @bcow I think we can start by confirming that the recent changes to PEcAn have not accidentally caused this issue by removing or modify the calls to close netCDF files. This has happened before when we accidentally changed the code / logic / flow and didn't realize the call to close the file after processing / extracting output wasn't being properly called, thus leading to files left open. Not a mem leak but just a coding mistake.

serbinsh commented 8 years ago

Actually after looking back over I bet the issue is in the read model output step.

ashiklom commented 8 years ago

Is there a good way to view all open netCDF files? I tried adding a showConnections() call inside read.output but it said there were never any files open, at least not after the nc_close statement.

On Mon, Apr 4, 2016, 8:57 AM Shawn P. Serbin notifications@github.com wrote:

Actually after looking back over I bet the issue is in the read model output step.

— You are receiving this because you were mentioned.

Reply to this email directly or view it on GitHub https://github.com/PecanProject/pecan/issues/775#issuecomment-205285862

serbinsh commented 8 years ago

It may be in read.ensemble.output()

If you don't locate I will take another look in a little bit. It may not be in the read output / convert to standardized netCDF but it could be a little further down stream when the files are opened and used...

ashiklom commented 8 years ago

All read.ensemble.output does is loop over ensemble IDs and call read.output -- it doesn't actually open any files. There are issues in both functions with inefficient dynamic memory allocation, but they shouldn't cause a "too many open files" error.

robkooper commented 8 years ago

Can you try and see if you can do an lsof to a file while the process is running. You might be able to use lsof -r 2 -c R to see all files opened by R, hit ctrl-C when done. See what new files are opened.

robkooper commented 8 years ago

One option is to add some code to that does:

system2("ls", c("-l", "/proc/self/fd"))

And look at the output. This will list all open files. Not sure if there is an easier way. For example I did:

> valid.nc <- tryl(nc_open("2004.nc"))
> nc <- nc_open("2004.nc")
> system2("ls", c("-l", "/proc/self/fd"))
total 0
lrwx------ 1 kooper users 64 Apr  4 09:22 0 -> /dev/pts/0
lrwx------ 1 kooper users 64 Apr  4 09:22 1 -> /dev/pts/0
lrwx------ 1 kooper users 64 Apr  4 09:22 2 -> /dev/pts/0
lr-x------ 1 kooper users 64 Apr  4 09:22 3 -> /home/kooper/output/PEcAn_99000000011/out/99000000008/2004.nc
lr-x------ 1 kooper users 64 Apr  4 09:22 4 -> /home/kooper/output/PEcAn_99000000011/out/99000000008/2004.nc
lr-x------ 1 kooper users 64 Apr  4 09:22 5 -> /proc/27957/fd

As you can see the file 2004.nc is open twice, so it is definitly a bug what Alexey found.

ashiklom commented 8 years ago

Actually, I don't think @bcow and I are having any issues -- I thought the process was slowing down because it made Rstudio unresponsive, but it was actually just that the web-based Rstudio was bogged down by all of the logger output. The actual process was evenly fast throughout. But it's probably still worth trying Rob's suggestions.

tonygardella commented 8 years ago

@serbinsh @ashiklom @robkooper @bcow It seems that Rob confirmed that there are too many files opened. Where/how do we fix this?

ashiklom commented 8 years ago

@tonygardella We need to figure out which function is opening the files. I think interspersing system calls to whatever commands @robkooper is using to check open files through functions to hone in on where they are being opened and not being closed. Or, rather than coding them, you can use R's debug function to step through function calls and manually check what files are open at each step.

dlebauer commented 8 years ago

p.s. The Rstudio provides a nice interface to the debugging environment

p.p.s. debug turns on debugging on for the specified function; debugonce is handy because it turns off debugging automatically when you end the debugging session On Wed, Apr 6, 2016 at 3:02 PM Alexey Shiklomanov notifications@github.com wrote:

@tonygardella https://github.com/tonygardella We need to figure out which function is opening the files. I think interspersing system calls to whatever commands @robkooper https://github.com/robkooper is using to check open files through functions to hone in on where they are being opened and not being closed. Or, rather than coding them, you can use R's debug https://stat.ethz.ch/R-manual/R-devel/library/base/html/debug.html function to step through function calls and manually check what files are open at each step.

— You are receiving this because you are subscribed to this thread.

Reply to this email directly or view it on GitHub https://github.com/PecanProject/pecan/issues/775#issuecomment-206534313

tonygardella commented 8 years ago

I went and ran ensembles at different sizes and got different errors at different sizes. There does seem to be a threshold of ensemble size between 190 and 200 where a run will fail at the ensemble step.

For 10,000:

Error in postgresqlExecStatement(conn, statement, ...) : 
  RS-DBI driver: (could not Retrieve the result : ERROR:  deadlock detected
DETAIL:  Process 23213 waits for RowShareLock on relation 94001 of database 25609; blocked by process 26719.
Process 26719 waits for AccessExclusiveLock on relation 94165 of database 25609; blocked by process 23213.
HINT:  See server log for query details.
)
Warning in postgresqlQuickSQL(conn, statement, ...) :
  Could not create executeINSERT INTO runs (model_id, site_id, start_time, finish_time, outdir, created_at, ensemble_id, parameter_list) values ('1000000002', '500', '2004/01/01', '2004/12/31', '', '2016-04-12 15:12:37', 1000003067, 'ensemble=5034')
2016-04-12 15:12:38 SEVERE [db.query] : 
   Error executing db query 'INSERT INTO runs (model_id, site_id, 
   start_time, finish_time, outdir, created_at, ensemble_id, 
   parameter_list) values ('1000000002', '500', '2004/01/01', '2004/12/31', 
   '', '2016-04-12 15:12:37', 1000003067, 'ensemble=5034')' errorcode=0 
   message='ERROR: deadlock detected DETAIL: Process 23213 waits for 
   RowShareLock on relation 94001 of database 25609; blocked by process 
   26719. Process 26719 waits for AccessExclusiveLock on relation 94165 of 
   database 25609; blocked by process 23213. HINT: See server log for query 
   details. ' 
> proc.time()
    user   system  elapsed 
 134.747   14.590 1128.842 

at 1000

Error in R_nc4_open: Too many open files
Error in nc_open(ncfile) : 
  Error in nc_open trying to open file /fs/data2/output//PEcAn_1000001458/out/1000439244/2004.nc
Calls: run.ensemble.analysis ... ensemble.ts -> read.ensemble.ts -> read.output -> nc_open
Error during wrapup: cannot open the connection
Execution halted
Error: failed to load sRGB colorspace file
Error during wrapup: cannot open the connection
Execution halted
sh: error while loading shared libraries: libtinfo.so.5: cannot open shared object file: Error 24

and for 200 and 500 they had a very similar error

2016-04-13 12:30:57 INFO   [read.output] : 
   NPP Mean: 5780 Median: 5780 
Error in ensemble.ts[[j]][as.numeric(row), ] <- newrun[[j]] : 
  number of items to replace is not a multiple of replacement length
Calls: run.ensemble.analysis -> ensemble.ts -> read.ensemble.ts
> proc.time()
    user   system  elapsed 
1400.175   66.883 1619.414 

The runs I tried with ensemble size 100, 150, 175, and 190 ran successfully. The run with ensemble size 1000 is hungup, but the one with 10000 exited with the aforementioned error. I think we can conclude that #772 is related to this error and finding which files are open and adding checks to close them is the next step. Does anyone have time to take lead on this?

robkooper commented 8 years ago

I think the error with 10,000 is related to the database sync.

tonygardella commented 8 years ago

I think we can close this. Previously I had been testing on BU servers, but today I was trying it on a PEcAn VM and did not run into any problems running ensembles sizes of 2-1000.

tonygardella commented 8 years ago

Yep I'm going to close this. Just completed a ensemble of size 2,000 on a VM. It did take 1.5 hours though.

istfer commented 8 years ago

I also got this and I also think the offending line is in read.output.R as @robkooper and @ashiklom found.

valid.nc <- tryl(nc_open(....)) here opens a connection that cannot be closed by the nc_close. I see many connections to model output ".../out/year.nc" when I list the open files

@tonygardella how should we proceed?

istfer commented 8 years ago

plus, this should be lapply( ... median ... )

tonygardella commented 8 years ago

Is this getting in the way of your work for the meeting?If it is, I'll open this and we can fix it. Also where did this happen? On test-pecan?

istfer commented 8 years ago

Yes it does, and it will for any one who does really long runs. I've been working with the emulator for a while so I wasn't getting this, and before I was working with Ryan's code which doesn't have this tryl line in the read.output.R, therefore this wasn't an issue for me then with the bruteforce-MCMC. I made changes in my build so it's working fine now, probably no one in the meeting will try such long runs but it's an easy fix, so I'd say we do it.

And yes, I'm getting this on test-pecan, I think the limit for open files would differ from system to system but eventually you would get this if you have long enough runs.

tonygardella commented 8 years ago

If the fix is easy and you have already implemented it, awesome! Make a pull request and I can pull that to a VM and test it.

Large ensembles are slow but work on the VM so if this makes things faster that would be great.

DongchenZ commented 1 year ago

I got this issue today when trying to generate the IC files. And I believe this issue can be solved by adding the ncdf4::nc_close(IC.nc) before the return within the pool_ic_netcdf2list function. Hope it helps.

mdietze commented 1 year ago

@DongchenZ not only should all calls to nc_open be paired with a nc_close, but it is often good to wrap a nc_close in an on.exit() call right after the nc_open. This ensure that the file connection gets closed when the function exits even if something goes wrong. The only exception to this would be if a single function is opening many files, in which case you should close one before opening the next. If you've encountered an example of a missing nc_close please submit a PR with a fix as soon as you are able.