HISKP-LQCD / sLapH-projection-NG

2 stars 0 forks source link

Segmentation faults during numeric projection #28

Closed martin-ueding closed 4 years ago

martin-ueding commented 4 years ago

I am currently running the numeric projections on the workstation cluster. Some of the runs just crash. Take this one here of configuration 1500 with the 3pi I3 system. It starts off nicely on one of the nodes:

+ hostname
dryades
+ date -Iseconds
2019-11-14T16:25:53+01:00
+ /usr/bin/time /home/ueding/sLapH-projection-NG/numeric_projection/driver.R 0 0 0 A1g 1500
[1] "/home/ueding/projection_workdir/cA2.09.48/3pi_I3"

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

[1] "0"    "0"    "0"    "A1g"  "1500"
0.82user 0.05system 0:00.90elapsed 97%CPU (0avgtext+0avgdata 73400maxresident)k
0inputs+24outputs (0major+28911minor)pagefaults 0swaps

After that the underlying shell script calls the driver script many more times forr the different irreps and moving frames, all successfully. These are the last successful ones:

+ /usr/bin/time /home/ueding/sLapH-projection-NG/numeric_projection/driver.R -1 -1 -1 E 1500
+ /usr/bin/time /home/ueding/sLapH-projection-NG/numeric_projection/driver.R -1 -1 1 E 1500
+ /usr/bin/time /home/ueding/sLapH-projection-NG/numeric_projection/driver.R -1 1 -1 E 1500
+ /usr/bin/time /home/ueding/sLapH-projection-NG/numeric_projection/driver.R -1 1 1 E 1500

But then it crashes with this here:

+ /usr/bin/time /home/ueding/sLapH-projection-NG/numeric_projection/driver.R 1 -1 -1 E 1500
[1] "/home/ueding/projection_workdir/cA2.09.48/3pi_I3"

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

[1] "1"    "-1"   "-1"   "E"    "1500"

 *** caught segfault ***
address 0x22c9, cause 'memory not mapped'

Traceback:
 1: H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,     compoundAsDataFrame = compoundAsDataFrame, drop = drop, ...)
 2: doTryCatch(return(expr), name, parentenv, handler)
 3: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 4: tryCatchList(expr, classes, parentenv, handlers)
 5: tryCatch({    obj <- H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile,         h5spaceMem = h5spaceMem, compoundAsDataFrame = compoundAsDataFrame,         drop = drop, ...)}, error = function(e) {    err <- h5checkFilters(h5dataset)    if (nchar(err) > 0)         stop(err, call. = FALSE)    else stop(e)})
 6: h5readDataset(h5dataset, index = index, start = start, stride = stride,     block = block, count = count, compoundAsDataFrame = compoundAsDataFrame,     drop = drop, ...)
 7: rhdf5::h5read(filename, datasetname)
 8: FUN(X[[i]], ...)
 9: lapply(needed_names, load_dataset)
An irrecoverable exception occurred. R is aborting now ...
Command terminated by signal 11
56.93user 36.80system 1:34.19elapsed 99%CPU (0avgtext+0avgdata 231112maxresident)k
0inputs+24outputs (0major+2868449minor)pagefaults 0swaps

I would think that this is an error outside of my control. It only uses 230 MB of memory, so that alone cannot be the issue on that machine, it has 16 GB of memory and I explicitly demand 500 MB in the SLURM job script, so I would not go on the node if it was that full already.

Configuration 340 has been running on vanguard and then crashed with the same error. I resubmitted the job and it ran through on vengeance. So it seems that this is just an annoying fluke and that I need to clean these up and resubmit.

martin-ueding commented 4 years ago

I have just written a script that just restarts all these broken ones. There are a lot of them, this is strange.

Up to now, these are the number of jobs that have crashed:

Host Broken Total
apollo 4 14
blackwidow 4 15
dagobert 4 7
deino 10 24
delphyne 10 31
dryades 8 26
echidna 8 26
echo 11 28
gustav 5 14
herkules 14 29
mystique 5 14
sose 5 24
thymbris 21 28
vanguard 10 46
vengeance 14 46
vigilant 8 39

Some have failed due to node failures, so that is a different problem. But I really wonder what the problem with HDF5 is.

kostrzewa commented 4 years ago

As far as I can see, the amount of real free memory (taking cache and buffers as occupied) is quite low on the machines that you're using as far as I have seen. Perhaps you are really reaching memory limits? R is quite conservative AFAIK in that it won't allocate if you're close to the limit.

martin-ueding commented 4 years ago

Yeah, but caches get freed automatically, so that should not count. On my laptop the cache usually is completely full after a while, and that has never been a problem.

kostrzewa commented 4 years ago

Is the farm configured with cgroups?

Yeah, but caches get freed automatically, so that should not count.

I wouldn't be surprised if R actually checks free memory, completely ignoring caches and buffers.

martin-ueding commented 4 years ago

Currently I do not have the cgroups enabled, Debian 10 came with a new version of SLURM and I have not set it up again properly. But it is on my list.

martin-ueding commented 4 years ago

It is just random, likely the file system. I don't care, I have a script that just restarts the failed ones. One has to do this until everything is done. Or do it on a system with proper file system.

martin-ueding commented 4 years ago

Memory usage isn't the problem. I have run this on the QBIG front-end where we have plenty of memory. Still it crashes. The traceback is slightly different in R 3.6 (compared to R 3.5 on the workstations).

 *** caught segfault ***
address 0x2a9, cause 'memory not mapped'

Traceback:
 1: H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,     compoundAsDataFrame = compoundAsDataFrame, drop = drop, ...)
 2: doTryCatch(return(expr), name, parentenv, handler)
 3: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 4: tryCatchList(expr, classes, parentenv, handlers)
 5: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call)[1L]        prefix <- paste("Error in", dcall, ": ")        LONG <- 75L        sm <- strsplit(conditionMessage(e), "\n")[[1L]]        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],                 type = "b")        if (w > LONG)             prefix <- paste0(prefix, "\n  ")    }    else prefix <- "Error : "    msg <- paste0(prefix, conditionMessage(e), "\n")    .Internal(seterrmessage(msg[1L]))    if (!silent && isTRUE(getOption("show.error.messages"))) {        cat(msg, file = outFile)        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error", condition = e))})
 6: try({    obj <- H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile,         h5spaceMem = h5spaceMem, compoundAsDataFrame = compoundAsDataFrame,         drop = drop, ...)})
 7: h5readDataset(h5dataset, index = index, start = start, stride = stride,     block = block, count = count, compoundAsDataFrame = compoundAsDataFrame,     drop = drop, ...)
 8: rhdf5::h5read(filename, datasetname)
 9: FUN(X[[i]], ...)
10: lapply(needed_names, load_dataset)
An irrecoverable exception occurred. R is aborting now ...

Interestingly it seems to be reproducible this time. When we just plot the configuration numbers and job numbers for the segmentation faults, we get this:

segfaults

The broken configurations are disjoint, so this indicates the correlator files. The prescriptions are the same in both cases. However, the 2pi subsystem has worked fine, no crashes there. It would seem that the segmentation fault happens in one of the C6 diagram files then.

I've looked through the correlator HDF5 files and all the files per type have the same number of bytes. It seems that the files are all identical. For instance configuration 1040 always fails and has these files:

$ ls -l *_cnfg1040.h5
-rw-r--r-- 1 ueding theorie     65024 Nov 22 14:43 C2c_cnfg1040.h5
-rw-r--r-- 1 ueding theorie   7661824 Nov 22 14:43 C4cC_cnfg1040.h5
-rw-r--r-- 1 ueding theorie   7607736 Nov 22 14:43 C4cD_cnfg1040.h5
-rw-r--r-- 1 ueding theorie 139084720 Nov 22 14:44 C6cC_cnfg1040.h5
-rw-r--r-- 1 ueding theorie 200961968 Nov 22 14:47 C6cCD_cnfg1040.h5
-rw-r--r-- 1 ueding theorie  69653152 Nov 22 14:47 C6cD_cnfg1040.h5

The next one worked fine, there we have the exact same sizes:

$ ls -l *_cnfg1044.h5
-rw-r--r-- 1 ueding theorie     65024 Nov 22 14:38 C2c_cnfg1044.h5
-rw-r--r-- 1 ueding theorie   7661824 Nov 22 14:38 C4cC_cnfg1044.h5
-rw-r--r-- 1 ueding theorie   7607736 Nov 22 14:38 C4cD_cnfg1044.h5
-rw-r--r-- 1 ueding theorie 139084720 Nov 22 14:39 C6cC_cnfg1044.h5
-rw-r--r-- 1 ueding theorie 200961968 Nov 22 14:43 C6cCD_cnfg1044.h5
-rw-r--r-- 1 ueding theorie  69653152 Nov 22 14:43 C6cD_cnfg1044.h5

I do not see what differentiates the two configurations.

martin-ueding commented 4 years ago

I managed to run momentum -1-11 in the A2 in configuration 1468 in RStudio on qbig. This did not work in the terminal before. Now I am processing the next correlator matrix on the same configuration. Oddly enough other frames with the same total momentum and irrep have worked out.

It really does feel like some fluke.

martin-ueding commented 4 years ago

I just let it run through just once more. The job scripts skip the computation when the output has already created, so that is rather quick. Strangely enough the failure pattern persists except for the one case where I managed it to succeed in RStudio.

segfaults

I absolutely don't get what makes those configurations and frames special. If it was a faulty HDF5 file, it must not be able to succeed with enough trials. If a prescription referred to an invalid correlator, it would fail for every configuration.

I have just discussed that with Bartek, apparently there have been a couple of segfault issues been reported to rhdf5 lately. Also I do not create a file handle but just pass the file name and the dataset name to to an apparent convenience function:

dataset <- rhdf5::h5read(filename, datasetname)

I will try to use some open-read-close thing in order to prevent the file from being opened multiple times within the same R session.

martin-ueding commented 4 years ago

I now use H5Fopen, h5read and then H5close. The jobs are running, and I did not have any failures so far. It seems that file handles were not released properly or something.

kostrzewa commented 4 years ago

Happy to hear that it seems to be working!