Open serbinsh opened 8 years ago
More errors like this. This time with a DALEC run:
2016-04-02 12:19:04 INFO [read.output] :
Reading output for Years: 2000 - 2002 in directory:
/data/Model_Output/pecan.output/PEcAn_2000000138/out/2000010646
including files 2000.nc 2001.nc 2002.nc
Error in R_nc4_open: Too many open files
Error in nc_open(ncfile) :
Error in nc_open trying to open file /data/Model_Output/pecan.output/PEcAn_2000000138/out/2000010646/2001.nc
Calls: run.ensemble.analysis ... ensemble.ts -> read.ensemble.ts -> read.output -> nc_open
Error during wrapup: cannot open the connection
Execution halted
Error: failed to load sRGB colorspace file
Error during wrapup: cannot open the connection
Execution halted
sh: error while loading shared libraries: libtinfo.so.5: cannot open shared object file: Error 24
Here is the run setup:
pecan.xml
<?xml version="1.0" encoding="UTF-8"?>
<pecan>
<info>
<notes>test run DALEC ponderosa pine</notes>
<userid>2000000002</userid>
<username>Shawn P. Serbin</username>
<date>2016/04/02 11:44:53 -0400</date>
</info>
<outdir>/data/Model_Output/pecan.output/PEcAn_2000000138</outdir>
<database>
<bety>
<user>bety</user>
<password>bety</password>
<host>localhost</host>
<dbname>bety</dbname>
<driver>PostgreSQL</driver>
<write>TRUE</write>
</bety>
</database>
<pfts>
<pft>
<name>dalec.pondersoa_pine</name>
<constants>
<num>1</num>
</constants>
<outdir>/data/Model_Output/pecan.output/PEcAn_2000000138/pft/dalec.pondersoa_pine</outdir>
</pft>
<pft>
<name>dalec.pondersoa_pine</name>
<constants>
<num>2</num>
</constants>
<outdir>/data/Model_Output/pecan.output/PEcAn_2000000138/pft/dalec.pondersoa_pine</outdir>
</pft>
</pfts>
<meta.analysis>
<iter>3000</iter>
<random.effects>FALSE</random.effects>
<threshold>1.2</threshold>
<update>AUTO</update>
</meta.analysis>
<ensemble>
<size>200</size>
<variable>NPP</variable>
<start.year>2000</start.year>
<end.year>2002</end.year>
</ensemble>
<sensitivity.analysis>
<quantiles>
<sigma>-3</sigma>
<sigma>-2</sigma>
<sigma>-1</sigma>
<sigma>0</sigma>
<sigma>1</sigma>
<sigma>2</sigma>
<sigma>3</sigma>
</quantiles>
<variable>NPP</variable>
<start.year>2000</start.year>
<end.year>2002</end.year>
</sensitivity.analysis>
<model>
<id>1000000002</id>
<type>DALEC</type>
<revision>Quaife Evergreen RSE08</revision>
<delete.raw>FALSE</delete.raw>
<binary>/data/software/dalec_EnKF_pub//dalec_seqMH</binary>
</model>
<workflow>
<id>2000000138</id>
</workflow>
<run>
<site>
<id>766</id>
<met.start>1999-01-01 00:00:00</met.start>
<met.end>2003-12-31 00:00:00</met.end>
<name>Metolius-first young aged pine (US-Me5)</name>
<lat>44.4372</lat>
<lon>-121.567</lon>
</site>
<inputs>
<met>
<id>2000000004</id>
<path>/data/Model_Data/dalec_EnKF_pub/input_data/oregon/dalec_drivers.OREGON.no_obs.dat</path>
</met>
</inputs>
<start.date>2000/01/01</start.date>
<end.date>2002/12/31</end.date>
<dbfiles>/data/Model_Output/pecan.output/dbfiles</dbfiles>
<host>
<name>localhost</name>
<qsub>qsub -V -N @NAME@ -o @STDOUT@ -e @STDERR@ -S /bin/bash</qsub>
<qsub.jobid>Your job ([0-9]+) .*</qsub.jobid>
<qstat>qstat @JOBID@ || echo DONE</qstat>
<rundir>/data/Model_Output/pecan.output/PEcAn_2000000138/run</rundir>
<outdir>/data/Model_Output/pecan.output/PEcAn_2000000138/out</outdir>
</host>
</run>
<email>
<to>sserbin@bnl.gov</to>
<url>https://modex.bnl.gov:443/pecan/08-finished.php?workflowid=2000000138</url>
</email>
<rundir>/data/Model_Output/pecan.output/PEcAn_2000000138/run</rundir>
<modeloutdir>/data/Model_Output/pecan.output/PEcAn_2000000138/out</modeloutdir>
</pecan>
Need to check the get.model.output code to make sure files are being closed properly
I think I might be having a related problem. Reading ensemble outputs is snappy at first but then becomes progressively slower as more files are loaded. I think there might be a memory leak in the read.output
function. Will try to investigate.
I think I may have found the issue. The offending lines are 126-128 in read.outputs.
I think the issue is that each nc
file is opened twice by nc_open
-- once in the tryl
function and again on the next line. Only the latter is closed, so as we loop through a bunch of files, we accumulate a memory leak. I'll open a pull request to try to fix this.
Never mind -- I tried that fix and it's still just as slow.
I wanted to run Rprofmem()
on it to see if anything was using up a lot of memory, but apparently R can't do it on pecan2
-- "Memory profiling not available on this system". I might try it on my local machine.
Is this something that needs root permissions? You can send me what you were going to run and I try it on test-pecan.
@bcow You can give it a whirl, but I don't think so. It's something you run from within R, so it shouldn't need root permissions. From the documentation, it seems like you'd need to recompile R with specific settings enabled, which we definitely don't want to try on a production machine.
Here are the relevant man pages:
https://stat.ethz.ch/R-manual/R-devel/library/utils/html/Rprof.html https://stat.ethz.ch/R-manual/R-devel/library/utils/html/Rprofmem.html
@ashiklom @bcow I think we can start by confirming that the recent changes to PEcAn have not accidentally caused this issue by removing or modify the calls to close netCDF files. This has happened before when we accidentally changed the code / logic / flow and didn't realize the call to close the file after processing / extracting output wasn't being properly called, thus leading to files left open. Not a mem leak but just a coding mistake.
Actually after looking back over I bet the issue is in the read model output step.
Is there a good way to view all open netCDF files? I tried adding a showConnections() call inside read.output but it said there were never any files open, at least not after the nc_close statement.
On Mon, Apr 4, 2016, 8:57 AM Shawn P. Serbin notifications@github.com wrote:
Actually after looking back over I bet the issue is in the read model output step.
— You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub https://github.com/PecanProject/pecan/issues/775#issuecomment-205285862
It may be in read.ensemble.output()
If you don't locate I will take another look in a little bit. It may not be in the read output / convert to standardized netCDF but it could be a little further down stream when the files are opened and used...
All read.ensemble.output does is loop over ensemble IDs and call read.output -- it doesn't actually open any files. There are issues in both functions with inefficient dynamic memory allocation, but they shouldn't cause a "too many open files" error.
Can you try and see if you can do an lsof
to a file while the process is running. You might be able to use lsof -r 2 -c R
to see all files opened by R, hit ctrl-C when done. See what new files are opened.
One option is to add some code to that does:
system2("ls", c("-l", "/proc/self/fd"))
And look at the output. This will list all open files. Not sure if there is an easier way. For example I did:
> valid.nc <- tryl(nc_open("2004.nc"))
> nc <- nc_open("2004.nc")
> system2("ls", c("-l", "/proc/self/fd"))
total 0
lrwx------ 1 kooper users 64 Apr 4 09:22 0 -> /dev/pts/0
lrwx------ 1 kooper users 64 Apr 4 09:22 1 -> /dev/pts/0
lrwx------ 1 kooper users 64 Apr 4 09:22 2 -> /dev/pts/0
lr-x------ 1 kooper users 64 Apr 4 09:22 3 -> /home/kooper/output/PEcAn_99000000011/out/99000000008/2004.nc
lr-x------ 1 kooper users 64 Apr 4 09:22 4 -> /home/kooper/output/PEcAn_99000000011/out/99000000008/2004.nc
lr-x------ 1 kooper users 64 Apr 4 09:22 5 -> /proc/27957/fd
As you can see the file 2004.nc is open twice, so it is definitly a bug what Alexey found.
Actually, I don't think @bcow and I are having any issues -- I thought the process was slowing down because it made Rstudio unresponsive, but it was actually just that the web-based Rstudio was bogged down by all of the logger output. The actual process was evenly fast throughout. But it's probably still worth trying Rob's suggestions.
@serbinsh @ashiklom @robkooper @bcow It seems that Rob confirmed that there are too many files opened. Where/how do we fix this?
@tonygardella We need to figure out which function is opening the files. I think interspersing system calls to whatever commands @robkooper is using to check open files through functions to hone in on where they are being opened and not being closed. Or, rather than coding them, you can use R's debug function to step through function calls and manually check what files are open at each step.
p.s. The Rstudio provides a nice interface to the debugging environment
p.p.s. debug
turns on debugging on for the specified function;
debugonce
is handy because it turns off debugging automatically when you
end the debugging session
On Wed, Apr 6, 2016 at 3:02 PM Alexey Shiklomanov notifications@github.com
wrote:
@tonygardella https://github.com/tonygardella We need to figure out which function is opening the files. I think interspersing system calls to whatever commands @robkooper https://github.com/robkooper is using to check open files through functions to hone in on where they are being opened and not being closed. Or, rather than coding them, you can use R's debug https://stat.ethz.ch/R-manual/R-devel/library/base/html/debug.html function to step through function calls and manually check what files are open at each step.
— You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub https://github.com/PecanProject/pecan/issues/775#issuecomment-206534313
I went and ran ensembles at different sizes and got different errors at different sizes. There does seem to be a threshold of ensemble size between 190 and 200 where a run will fail at the ensemble step.
For 10,000:
Error in postgresqlExecStatement(conn, statement, ...) :
RS-DBI driver: (could not Retrieve the result : ERROR: deadlock detected
DETAIL: Process 23213 waits for RowShareLock on relation 94001 of database 25609; blocked by process 26719.
Process 26719 waits for AccessExclusiveLock on relation 94165 of database 25609; blocked by process 23213.
HINT: See server log for query details.
)
Warning in postgresqlQuickSQL(conn, statement, ...) :
Could not create executeINSERT INTO runs (model_id, site_id, start_time, finish_time, outdir, created_at, ensemble_id, parameter_list) values ('1000000002', '500', '2004/01/01', '2004/12/31', '', '2016-04-12 15:12:37', 1000003067, 'ensemble=5034')
2016-04-12 15:12:38 SEVERE [db.query] :
Error executing db query 'INSERT INTO runs (model_id, site_id,
start_time, finish_time, outdir, created_at, ensemble_id,
parameter_list) values ('1000000002', '500', '2004/01/01', '2004/12/31',
'', '2016-04-12 15:12:37', 1000003067, 'ensemble=5034')' errorcode=0
message='ERROR: deadlock detected DETAIL: Process 23213 waits for
RowShareLock on relation 94001 of database 25609; blocked by process
26719. Process 26719 waits for AccessExclusiveLock on relation 94165 of
database 25609; blocked by process 23213. HINT: See server log for query
details. '
> proc.time()
user system elapsed
134.747 14.590 1128.842
at 1000
Error in R_nc4_open: Too many open files
Error in nc_open(ncfile) :
Error in nc_open trying to open file /fs/data2/output//PEcAn_1000001458/out/1000439244/2004.nc
Calls: run.ensemble.analysis ... ensemble.ts -> read.ensemble.ts -> read.output -> nc_open
Error during wrapup: cannot open the connection
Execution halted
Error: failed to load sRGB colorspace file
Error during wrapup: cannot open the connection
Execution halted
sh: error while loading shared libraries: libtinfo.so.5: cannot open shared object file: Error 24
and for 200 and 500 they had a very similar error
2016-04-13 12:30:57 INFO [read.output] :
NPP Mean: 5780 Median: 5780
Error in ensemble.ts[[j]][as.numeric(row), ] <- newrun[[j]] :
number of items to replace is not a multiple of replacement length
Calls: run.ensemble.analysis -> ensemble.ts -> read.ensemble.ts
> proc.time()
user system elapsed
1400.175 66.883 1619.414
The runs I tried with ensemble size 100, 150, 175, and 190 ran successfully. The run with ensemble size 1000 is hungup, but the one with 10000 exited with the aforementioned error. I think we can conclude that #772 is related to this error and finding which files are open and adding checks to close them is the next step. Does anyone have time to take lead on this?
I think the error with 10,000 is related to the database sync.
I think we can close this. Previously I had been testing on BU servers, but today I was trying it on a PEcAn VM and did not run into any problems running ensembles sizes of 2-1000.
Yep I'm going to close this. Just completed a ensemble of size 2,000 on a VM. It did take 1.5 hours though.
I also got this and I also think the offending line is in read.output.R as @robkooper and @ashiklom found.
valid.nc <- tryl(nc_open(....))
here opens a connection that cannot be closed by the nc_close. I see many connections to model output ".../out/year.nc" when I list the open files
@tonygardella how should we proceed?
Is this getting in the way of your work for the meeting?If it is, I'll open this and we can fix it. Also where did this happen? On test-pecan?
Yes it does, and it will for any one who does really long runs. I've been working with the emulator for a while so I wasn't getting this, and before I was working with Ryan's code which doesn't have this tryl
line in the read.output.R, therefore this wasn't an issue for me then with the bruteforce-MCMC. I made changes in my build so it's working fine now, probably no one in the meeting will try such long runs but it's an easy fix, so I'd say we do it.
And yes, I'm getting this on test-pecan, I think the limit for open files would differ from system to system but eventually you would get this if you have long enough runs.
If the fix is easy and you have already implemented it, awesome! Make a pull request and I can pull that to a VM and test it.
Large ensembles are slow but work on the VM so if this makes things faster that would be great.
I got this issue today when trying to generate the IC files. And I believe this issue can be solved by adding the ncdf4::nc_close(IC.nc)
before the return
within the pool_ic_netcdf2list
function. Hope it helps.
@DongchenZ not only should all calls to nc_open be paired with a nc_close, but it is often good to wrap a nc_close in an on.exit() call right after the nc_open. This ensure that the file connection gets closed when the function exits even if something goes wrong. The only exception to this would be if a single function is opening many files, in which case you should close one before opening the next. If you've encountered an example of a missing nc_close please submit a PR with a fix as soon as you are able.
I just got this error running PEcAn at command with SIPNET. We used to get errors like this a lot in the past. Not sure if this is a gremlin in the machine or this issue is resurfacing. Documenting this issue here for now until i can be sure.