Segmentation fault when loading large npy files (30GB input)

ryananeff commented 5 years ago

Hi there!

I tried loading in the npy file from the Skymap dataset (Hannah Carter Lab, see link, in the efs/rnaseq_merged/ folder), but it fails to load. The file is 30GB in size, not compressed. The dimensions of the npy file are 34677 rows × 225203 columns. It is not gzip compressed, and was saved using this command: np.save(filename+".npy",myDF.values)

Not sure why it is crashing. Here's the output below. Please give it a go

Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(RcppCNPy)
> mmat = npyLoad("Mus_musculus.gene_symbol.tpm.npy")

 *** caught segfault ***
address 0x7f1d955e7000, cause 'memory not mapped'

Traceback:
 1: .External(list(name = "InternalFunction_invoke", address = <pointer: 0x22f3160>,     dll = list(name = "Rcpp", path = "/hpc/packages/minerva-common/rpackages/3.4.3/site-library/Rcpp/libs/Rcpp.so",         dynamicLookup = TRUE, handle = <pointer: 0x2218080>,         info = <pointer: 0xc19e00>), numParameters = -1L), <pointer: 0x221acf0>,     filename, type, dotranspose)
 2: npyLoad("Mus_musculus.gene_symbol.tpm.npy")

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: 3

eddelbuettel commented 5 years ago

How much RAM does the machine have?

Can you subset the data?

ryananeff commented 5 years ago

The machine has 1.55 TB of RAM. I would like to convert the npy array to one for R analysis in a shared HPC computing environment where many people will be accessing the data. I do not know which columns/rows they need right now or in the future at this time, however currently we're doing exploratory analysis on around 100 experiments from the dataset, but we may expand that in the future.

eddelbuettel commented 5 years ago

Ok, so it may now resource-starve itself but maybe run out of indexing size (the R_lent_t vs R_xlen_t issue).

Can you maybe dig a little a see if the error is

on the inpu-receiving side
the cnpy layer
the output-preparation side?

Ie maybe mock a standalone cnpy call?

Also see the vignette I added using reticulate--it may give you a second codepath but I am not sure if "extra large" size as you have is fully tested there either.

RCIIIcm commented 5 years ago

I'm having the same issue. The file is 12GB uncompressed, the machine has 48GB RAM. How can I check to see where the error is coming from? Install CNPy and see if I can load the file from C/C++? Or is can I do this directly from the RcppCNPy?

dk657@tabulator:~/sdss_lite$ R

R version 3.6.0 (2019-04-26) -- "Planting of a Tree"
Copyright (C) 2019 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> library(RcppCNPy)
> restframe <- npyLoad("restframe.npy")

 *** caught segfault ***
address 0x7f2d33bcc440, cause 'memory not mapped'

Traceback:
 1: .External(list(name = "InternalFunction_invoke", address = <pointer: 0x3821720>,     dll = list(name = "Rcpp", path = "/home/dk657/.local/lib/R/library/Rcpp/libs/Rcpp.so",         dynamicLookup = TRUE, handle = <pointer: 0x3a1eba0>,
     info = <pointer: 0x1893b30>), numParameters = -1L), <pointer: 0x2b360a0>,
   filename, type, dotranspose)
 2: npyLoad("restframe.npy")

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection:

eddelbuettel commented 5 years ago

Please try to debug it with cnpy only (ie from a C routine) so that we can see if an R-influenced data structure (32-bit int?) overflows.

RCIIIcm commented 5 years ago

The following loads the file fine and dumps a bunch of the first row onto the screen.

#include "cnpy.h"

int main() {
  cnpy::NpyArray arr = cnpy::npy_load("restframe.npy");
  float* loaded_data = arr.data<float>();
  for(int i=0; i<1000;i++) {
        if( i % 10 == 0)
                std::cout << std::endl;
        std::cout << loaded_data[i] << "\t";
  }
  return 0;
}

If it's helpful, I created the file in Python like so:

restframe = np.memmap("restframe.mm",
                      dtype="float32", mode="w+",
                      shape=(n,p), order="C")
# ... fill it up...
np.save("restframe.npy",restframe)

The data as accessed from cnpy::npy_load() came in row-major; I'm not sure if that's standard, or if it's because I specified row-major (order="C") when I created the memmap and that carried over into the npy file.

eddelbuettel commented 5 years ago

Good point. I think we currently try a naive transpose.

I am not sure there is something we can do here. :-/ You could try the reticulate approach which I detail in the second (newer) vignette but it also requires, methinks, that the object loads.

We have a ginormous machine at work where I could play with this for a few minutes at the end of the day. "Given sufficient ram" it may work, or it may still fail -- I use size_t in a number of places and maybe that just overflows.

Can you ... split your data?

RCIIIcm commented 5 years ago

I just did what I needed to do in C++, so I no longer have an outstanding issue.

I tried importing with "dotranspose=F" and had the same error.

I also just tried it with two subsets of the problem data set, 10,000x3840 and 1000x3840 respectively (147M and 15M) and got the same error. I'll poke around more when I get a chance.

vivekverma080698 commented 5 years ago

Session termination while loading 78.3 MB numpy file. library('RcppCNPy') mat <- RcppCNPy::npyLoad('feature_0.npy',type = 'numeric')

RCIIIcm commented 4 years ago

Ah! I have been working with single-precision floats this whole time (i.e. numpy.float32). I wonder if that is the source of the trouble, rather than the size of the array? I just converted one of the arrays I was having trouble with to numpy.float64 and loaded just fine.

The array described in the post that opened this issue appears to be single-precision as well: the described dimensions with 4-byte (from, e.g., numpy.float32(0).itemsize) items gives a ~30GB object.

There are no single-precision floats in R though, right?

eddelbuettel commented 4 years ago

There are no single-precision floats in R though, right?

Right. (And another recent long-running issue thread was by someone discovering late that Python's integer is int64 by default whereas we have int32.)

R has a really limited set of types: 32-bit integer, signed. 64-bit double precision. That is it. And RcppCNPy does not cast as it tries to be efficient and directly memory maps. I have some words in the documentation but I guess it may bear repeating.

RCIIIcm commented 4 years ago

What about failing with a message if an incorrect type is passed? Maybe right after checking for endianness on line 94 in cnpy.cpp while parsing the header there could be a check for float64/int32?

bool acceptableType = (header[loc1+1] == 'f' && header[loc1+2] == '8') || (header[loc1+1] == 'i' && header[loc1+2] == '4');
Rassert(acceptableType, "Only double-precision float or 32-bit integer are allowed");

eddelbuettel commented 4 years ago

In principle, yes. In practice, harder as the header var is seen only in the (original) cnpy library and I prefer to alter that code as little as possible (as updating to newer versions becomes too tedious). You may also need multiple entry points for compressed and plain files.

RCIIIcm commented 4 years ago

Ah, I see. I hadn't noticed that was a replica of the original cnpy.cpp. We can still see that word size in the cnpy::NpyArray struct that's hanging around in cnpyMod.cpp, right? Can we check that it matches the appropriate word size in R?

eddelbuettel commented 4 years ago

I don't see how. But if you wan to try in a pull request maybe you get there.

Otherwise the (default) status quo of works on data in default sizes remains....

eddelbuettel / rcppcnpy

Segmentation fault when loading large npy files (30GB input) #22