HISKP-LQCD / sLapH-projection-NG

2 stars 0 forks source link

Problem with cA2.60.32 on P=-10-1 irrep=B2 #30

Open martin-ueding opened 4 years ago

martin-ueding commented 4 years ago

I am re-running the projections on cA2.60.32 and they work just fine for almost all irreps in every configuration. There is just one exception, namely P = (-1, 0, -1) in the B₂ irrep. And that for every configuration. It is always this output:

Opening HDF5 files …
[1] "correlators/C2c_cnfg0000.h5"
[1] "correlators/C4cC_cnfg0000.h5"
[1] "correlators/C4cD_cnfg0000.h5"
[1] "correlators/C6cC_cnfg0000.h5"
[1] "correlators/C6cCD_cnfg0000.h5"
[1] "correlators/C6cD_cnfg0000.h5"
  Done
Loading correlators from HDF5 files …

 *** caught segfault ***
address 0x22039, cause 'memory not mapped'

Traceback:
 1: H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile, h5spaceMem = h5spaceMem,     compoundAsDataFrame = compoundAsDataFrame, drop = drop, ...)
 2: doTryCatch(return(expr), name, parentenv, handler)
 3: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 4: tryCatchList(expr, classes, parentenv, handlers)
 5: tryCatch(expr, error = function(e) {    call <- conditionCall(e)    if (!is.null(call)) {        if (identical(call[[1L]], quote(doTryCatch)))             call <- sys.call(-4L)        dcall <- deparse(call)[1L]        prefix <- paste("Error in", dcall, ": ")        LONG <- 75L        sm <- strsplit(conditionMessage(e), "\n")[[1L]]        w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")        if (is.na(w))             w <- 14L + nchar(dcall, type = "b") + nchar(sm[1L],                 type = "b")        if (w > LONG)             prefix <- paste0(prefix, "\n  ")    }    else prefix <- "Error : "    msg <- paste0(prefix, conditionMessage(e), "\n")    .Internal(seterrmessage(msg[1L]))    if (!silent && isTRUE(getOption("show.error.messages"))) {        cat(msg, file = outFile)        .Internal(printDeferredWarnings())    }    invisible(structure(msg, class = "try-error", condition = e))})
 6: try({    obj <- H5Dread(h5dataset = h5dataset, h5spaceFile = h5spaceFile,         h5spaceMem = h5spaceMem, compoundAsDataFrame = compoundAsDataFrame,         drop = drop, ...)})
 7: h5readDataset(h5dataset, index = index, start = start, stride = stride,     block = block, count = count, compoundAsDataFrame = compoundAsDataFrame,     drop = drop, ...)
 8: h5read(file_handles[[diagram]], datasetname)
 9: FUN(X[[i]], ...)
10: lapply(needed_names, load_dataset)
11: numericprojection::numeric_projection(c(-1, 0, -1), "B2", 0)
An irrecoverable exception occurred. R is aborting now ...

I have tried to restart these jobs, but that did not help either. We had some random segfaults before, but this is consistent. It seems that it has something to do with the actual files. And it happens on all of the nodes that I have tried.

The only difference in input is the prescription file. And that does not differ from the other ensembles. And the ones related with a global rotation are just fine.

For the meantime I will just skip that B₂ irrep at P² = 2, but it feels very peculiar and I still have no idea what happens there.

martin-ueding commented 4 years ago

For some reason this went through for two configurations this time:

$ ls resolved_-10-1_B2_*
resolved_-10-1_B2_2496.js  resolved_-10-1_B2_5328.js
kostrzewa commented 4 years ago

This is very strange. My first instinct would be to guess that it's related to having too many HDF5 files open at the same time (I could imagine that these are internally opened using mmap), but this would suggest that things would also fail elsewhere.

kostrzewa commented 4 years ago

I guess in the original description you mean (-1, 0, -1) rather than (-1, 0, 1), correct?

martin-ueding commented 4 years ago

I really don't get it either. And there are not too many HDF5 files open, I start a new R process for every configuration and every irrep. It just crashes. And since it worked on two configurations, there cannot be something completely wrong with the program or the files.

kostrzewa commented 4 years ago

I meant globally. When there are O(30) projection jobs running, the number of memory mapped files will be rather large and this might be problematic for Lustre. What if you run a projection for a single config on QBIG?

martin-ueding commented 4 years ago

What if you run a projection for a single config on QBIG?

After all the projections were done, I did try that to see what the issue was. It seems that even with a single irrep in the whole cluster there is a problem.

I will find out how the other ensembles fare with that, perhaps it is always this irrep or just that irrep on cA2.60.32.

kostrzewa commented 3 years ago

Hah, we figured in the end. @matfischer observed the same problem and it was solved by reinstalling rhdf5 :)

kostrzewa commented 3 years ago

not so fast, apparently...