NUStatBioinfo / DegNorm

Normalizing RNA degradation in RNA-seq data
https://nustatbioinfo.github.io/DegNorm/
3 stars 1 forks source link

Job is killed during the initial NMF-OA iteration #39

Closed cheffelfinger closed 4 years ago

cheffelfinger commented 4 years ago

Hello, I'm testing DegNorm out on some RNA-seq data that I suspect may have some degraded samples. There've been no issues loading the bam files and generating the coverage matrices, but when starting the NMF-OA iterations I get a "Killed" message with no other errors and the program exits.

Here're the relevant lines (I think) from the log. DegNorm (01/22/2020 10:00:55) ---- DegNorm will run on 101040 genes, downsampling rate = 1 / 1, with baseline selection. DegNorm (01/22/2020 10:00:55) ---- Executing NMF-OA over-approximation algorithm... DegNorm (01/22/2020 10:00:56) ---- At least one coverage matrix is taller than it is wide.Ensure that coverage matrices are shaped (p x L_i). DegNorm (01/22/2020 12:33:14) ---- Initial sequencing depth scale factors --

Here's the output I see. Note that it does appear to generate the initial sequencing depth scale factors.

NMF-OA iteration progress: 0%| | 0/5 [00:00<?, ?it/s] Killed

Any ideas how I can figure out what's causing this? I seem to be able to warm start with directory, so I've tried adjusting the number of processors between 5 and 20. I'm running it on a virtual machine running Ubuntu 16.04 with 28 cores and 256GB of RAM.

ffineis commented 4 years ago

Hello, thanks for using DegNorm.

100k genes, sounds like a big job. Just curious, how many RNA Seq samples are there? I would profile the memory usage using a handful of samples - e.g. 5-10 - with a tool like psrecord (https://pypi.org/project/psrecord/) or set up your own memory logging with the watch linux utility (https://linuxize.com/post/linux-watch-command/). Track the max resident memory and see where it spikes with 5-10 samples, and that should give you an idea of how scaling will work.

Second, you may want to look into using degnorm_mpi to distribute the NMF decomposition work across several servers.

Given the recent demand for very large sample sizes (e.g. users with 100s of samples) or now with genomes with 100k+ genes, I'm going to look into using redis to cache the coverage matrices for memory efficiency purposes and was going to start testing that out in February. There are likely other memory-saving measures I can take as well, but again, that is development I can't get to for a few weeks.

cheffelfinger commented 4 years ago

Thanks for the ideas, it helps a lot to know it might be a memory issue. Setting up an mpi run might be tricky, but reducing the number of samples sounds like it should work. While it would be nice to do the entire dataset, we should be able to work with subsets. We have another node with about 3x the memory that I can try it out on as well. Anyways, I'll look forward to the updated version!