NLeSC / esibayes

Optimization and state estimation of dynamic models
Apache License 2.0
2 stars 3 forks source link

error when (de)serializing using /dev/shm #24

Open jspaaks opened 9 years ago

jspaaks commented 9 years ago

perhaps related to the size of the ramdisk v that of the file in ram:

jspaaks@login2:~/mmsoda-build$ ssh r41n4
jspaaks@r41n4:~$ ls -l 
total 5212292
drwxr-xr-x 8 jspaaks jspaaks       4096 Jun 23 13:47 esibayes
drwxr-xr-x 5 jspaaks jspaaks       4096 Jul  2 16:05 mmsoda-build
jspaaks@r41n4:~$ ls /dev/shm
mmpi.BMdbVT
jspaaks@r41n4:~$ ls -l /dev/shm/mmpi.BMdbVT 
-rw-rw-r-- 1 jspaaks jspaaks 412076224 Jul  2 16:06 /dev/shm/mmpi.BMdbVT
jspaaks@r41n4:~$ Connection to r41n4 closed by remote host.
Connection to r41n4 closed.

412 MB (for what is supposed to be a small file)

here is the error that comes after this


Error using bcastvar
Cannot write to temporary file! (msg id = 4)

Error in runmpirankOther (line 31)

Error in matlabmain (line 80)

Note this was for a mmsoda reset run with these settings:

>> conf = load('results/conf.mat')      

conf = 

                     modeStr: 'reset'
                   modelName: 'matlabswms'
                  objCallStr: 'calcLikelihood'
                    parNames: {'etai'  'kd'  'etae'  'smax'  'dmax'}
                 parNamesTex: {'{\eta_{I}}_{c}'  '{k_{D}}_{c}'  '{\eta_{E}}_{c}'  'S_{max}'  'd_{max}'}
             parSpaceLoBound: [0 2.0833 0 1.0000e-05 0]
             parSpaceHiBound: [1 20.8333 1 0.0100 6]
                  priorTimes: [1x721 double]
                      nCompl: 3
                    nSamples: 60
                      doPlot: 1
              sampleDrawMode: 'stratified'
            startFromUniform: 1
           visualizationCall: 'mmsodaVisualization3'
                    walltime: 0.0021
              archiveResults: 0
    parameterSamplesAreGiven: 1
                stateNamesKF: {1x756 cell}
                   namesNOKF: {1x76 cell}
                    obsState: [756x721 double]
                initValuesKF: [756x1 double]
                initMethodKF: 'reference'
              initValuesNOKF: [76x1 double]
              initMethodNOKF: 'reference'

Another possibility is that the matlab VMs are becoming too big (because of the many states), so there's no space left to make the (de)serialize file. Remember you have 16+1 matlab instances and 32 GB total per node). I tested the same run as above (but now with parameterSamplesAreGiven = false, I don't know why that was set like that), but using 24 hours of data. That ran fine. Memory use was about 4.6 GB of 32GB.

Another quick test with 10 days data saw a quick rise in memory use. Highest I saw from htop was 24GB but afterward it seemed that a couple of processes weren't doing anything anymore, so they probably crashed due to out-of-memory errors.