Open jhasselmann808 opened 3 years ago
Hi there,
I have a somewhat irrelevant followup question. How was your antibody tag generated? I have ADT, GEX and HTO and 10X matrix good to go. Just wonder what I should use for the antibody input argument for CITE-seq-Count.
Thank you
Hello @jhasselmann808, Yes, each supplementary code used will increase memory usage. The next release does help for the mapping stage and the bottleneck for memory is now more on the UMI correction.
I would suggest using fewer cores. I think 4 to 6 might be a sweet spot for your data.
Multithreading on python is not the best from what I've seen and I don't have enough experience with it to use an obvious fix that I might be missing. So, for now, that's the best I can propose.
Let me know if you get blocked by this.
@Hoohm Thanks for getting back to me. I tried running on 2 cores and it still crashed. Since everything works on a single core, I will just shelf my impatience for now and run my samples that way. It sounds like the adjustments on the next release may help, so I will be looking forward the that.
@YingzhengXu I am assuming this comment was directed to me. I used Biolegend TotalSeq-A antibodies so my tags.csv file consisted of the 15bp tag and the gene name separated by a comma. The file looks like this:
GAGTCACCAATCTGC,CD9 ACTGATGGACTCAGA,ITGA1 TCAGAACGTCTAACT,MRC1 TGACCCGACCTTTAG,ABCB1 TAAGACTTGGCCGTC,ABCG2 TTTCAACGCCCTTTC,ANPEP
If you are using a different format for your antibody tags, you will have to consult the information at https://hoohm.github.io/CITE-seq-Count/Running-the-script/ since I am not familiar with all the details of other tag preparations.
@jhasselmann808 sounds good :)
Btw, are you rinning 1.4.4?
Yes, I am running 1.4.4
Hi @Hoohm, I have been trying to align some ADT data and the process pool keeps crashing after maxing out the RAM available on the system. I am running on a system with 32 cores and 64GB or RAM and the call I am using is:
CITE-seq-Count -T 20 -R1 '/path/to/read1.fastq.gz' -R2 '/path/to/read2.fastq.gz' -t '/path/to/tags.csv' -cbf 1 -cbl 16 -umif 17 -umil 28 -cells 5313 -wl "/path/to/whitelist.tsv" -o "/path/to/output/directory"
After running for awhile, the memory maxes out and the terminal says "Killed" and then starts throwing repeated broken pipe errors with the following traceback (or identical for other worker processes):
This is my first time doing CITE-seq, so maybe my data set is just larger than normal (261 antibody tags, 3.4GB Read 1, and 2.2GB Read 2)? I have found that if I subsample the fastq files or I reduce the number of tags in the tags.csv file, everything will run fine. The pipeline will also complete if I run the files on a single core, so I know the issue isn't with the files themselves. It seems to be a problem with the apply_async call that the map_reads function is in, but I am not incredibly well versed in Python, so I am not sure how to implement the parallelization in a way that avoids the memory issue.
In case it helps, you can download the dataset that I am processing here (https://www.dropbox.com/s/7dqcrn2wl3jz668/CITE_ADT_File.zip?dl=0).
Let me know if I can provide any other details and thanks for you time and for developing CITE-seq-Count!!