Excessive memory use in apply_async causing pool to crash

jhasselmann808 commented 3 years ago

Hi @Hoohm, I have been trying to align some ADT data and the process pool keeps crashing after maxing out the RAM available on the system. I am running on a system with 32 cores and 64GB or RAM and the call I am using is:

CITE-seq-Count -T 20 -R1 '/path/to/read1.fastq.gz' -R2 '/path/to/read2.fastq.gz' -t '/path/to/tags.csv' -cbf 1 -cbl 16 -umif 17 -umil 28 -cells 5313 -wl "/path/to/whitelist.tsv" -o "/path/to/output/directory"

After running for awhile, the memory maxes out and the terminal says "Killed" and then starts throwing repeated broken pipe errors with the following traceback (or identical for other worker processes):

Process ForkPoolWorker-13 Traceback (most recent call last): File "/home/mbj-lab/anaconda3/lib/python3.7/site-packages/multiprocess/process.py", line 297, in _bootstrap self.run() File "/home/mbj-lab/anaconda3/lib/python3.7/site-packages/multiprocess/process.py", line 99, in run self._target(*self._args, **self._kwargs) File "/home/mbj-lab/anaconda3/lib/python3.7/site-packages/multiprocess/pool.py", line 132, in worker put((job, i, (False, wrapped))) File "/home/mbj-lab/anaconda3/lib/python3.7/site-packages/multiprocess/queues.py", line 367, in put self._writer.send_bytes(obj) File "/home/mbj-lab/anaconda3/lib/python3.7/site-packages/multiprocess/connection.py", line 203, in send_bytes self._send_bytes(m[offset:offset + size]) File "/home/mbj-lab/anaconda3/lib/python3.7/site-packages/multiprocess/connection.py", line 400, in _send_bytes self._send(header) File "/home/mbj-lab/anaconda3/lib/python3.7/site-packages/multiprocess/connection.py", line 371, in _send n = write(self._handle, buf) BrokenPipeError: [Errno 32] Broken pipe

This is my first time doing CITE-seq, so maybe my data set is just larger than normal (261 antibody tags, 3.4GB Read 1, and 2.2GB Read 2)? I have found that if I subsample the fastq files or I reduce the number of tags in the tags.csv file, everything will run fine. The pipeline will also complete if I run the files on a single core, so I know the issue isn't with the files themselves. It seems to be a problem with the apply_async call that the map_reads function is in, but I am not incredibly well versed in Python, so I am not sure how to implement the parallelization in a way that avoids the memory issue.

In case it helps, you can download the dataset that I am processing here (https://www.dropbox.com/s/7dqcrn2wl3jz668/CITE_ADT_File.zip?dl=0).

Let me know if I can provide any other details and thanks for you time and for developing CITE-seq-Count!!

YingzhengXu commented 3 years ago

Hi there,

I have a somewhat irrelevant followup question. How was your antibody tag generated? I have ADT, GEX and HTO and 10X matrix good to go. Just wonder what I should use for the antibody input argument for CITE-seq-Count.

Thank you

Hoohm commented 3 years ago

Hello @jhasselmann808, Yes, each supplementary code used will increase memory usage. The next release does help for the mapping stage and the bottleneck for memory is now more on the UMI correction.

I would suggest using fewer cores. I think 4 to 6 might be a sweet spot for your data.

Multithreading on python is not the best from what I've seen and I don't have enough experience with it to use an obvious fix that I might be missing. So, for now, that's the best I can propose.

Let me know if you get blocked by this.

jhasselmann808 commented 3 years ago

@Hoohm Thanks for getting back to me. I tried running on 2 cores and it still crashed. Since everything works on a single core, I will just shelf my impatience for now and run my samples that way. It sounds like the adjustments on the next release may help, so I will be looking forward the that.

@YingzhengXu I am assuming this comment was directed to me. I used Biolegend TotalSeq-A antibodies so my tags.csv file consisted of the 15bp tag and the gene name separated by a comma. The file looks like this:

GAGTCACCAATCTGC,CD9 ACTGATGGACTCAGA,ITGA1 TCAGAACGTCTAACT,MRC1 TGACCCGACCTTTAG,ABCB1 TAAGACTTGGCCGTC,ABCG2 TTTCAACGCCCTTTC,ANPEP

If you are using a different format for your antibody tags, you will have to consult the information at https://hoohm.github.io/CITE-seq-Count/Running-the-script/ since I am not familiar with all the details of other tag preparations.

Hoohm commented 3 years ago

@jhasselmann808 sounds good :)

Btw, are you rinning 1.4.4?

jhasselmann808 commented 3 years ago

Yes, I am running 1.4.4

Hoohm / CITE-seq-Count

Excessive memory use in apply_async causing pool to crash #148