Rosemeis / pcangsd

Framework for analyzing low depth NGS data in heterogeneous populations using PCA.
GNU General Public License v3.0
46 stars 11 forks source link

Memory issue while parsing beagle file #50

Closed ddenney1 closed 3 years ago

ddenney1 commented 3 years ago

Hello,

I am trying to run selection scans (both -selection and -pcadapt) using a beagle file from the entire genome vs on a per chromosome basis. When I run the analyses on each chromosome individually, I have some chromosomes with much higher inflated p-values compared to the other chromosomes. I'd like to run the entire genome at once to see if it will fix this issue. However, when running PCangsd on the beagle file for the entire genome, I get this error:

Parsing Beagle file. Traceback (most recent call last): File "/scratch/dd66718/pcangsd/pcangsd.py", line 155, in L = reader_cy.readBeagle(args.beagle) File "reader_cy.pyx", line 16, in reader_cy.readBeagle cpdef np.ndarray[DTYPE_t, ndim=2] readBeagle(str beagle): File "reader_cy.pyx", line 48, in reader_cy.readBeagle cdef np.ndarray[DTYPE_t, ndim=2] L_np = np.empty((m, n), dtype=DTYPE) numpy.core._exceptions.MemoryError: Unable to allocate array with shape (156909440, 1065) and data type float32

I am using a script that calls for 750gb of memory. When the script fails on the cluster, the output reports I have used 700gb of memory. I am not as familiar with Python as I would like to be. Could I simply update the data type to be float64 in the pcangsd python scripts? Or do you have another suggestion to allow it to load large beagle files?

Thanks, Derek

Rosemeis commented 3 years ago

Hi Derek,

PCAngsd needs to load in the entire genotype likelihood file into memory, which is the issue for you here. The "np.empty()" should not initialize the matrix but it seems that there is an internal check for the memory available anyways (in NumPy). And using np.float64 would only double the memory usage.

But I would advise you to perform MAF filtering before anything else in your ANGSD run as most of these variants are fixed anyways and would be filtered out. That would reduce your memory consumption to probably around 5% of what you are using now.

Best, Jonas

ddenney1 commented 3 years ago

Hi Jonas,

Thanks for this. That was very helpful. I was misinformed on protocols for my initial dataset so I was using a much larger file than necessary.

Best, Derek