epi2me-labs / wf-single-cell

Other
64 stars 29 forks source link

processes got stuck on large input #119

Open wanghlv opened 1 week ago

wanghlv commented 1 week ago

Ask away!

Hi! I have a quick question. I've read several post that people have their processes stuck specifically at stringtie step for large input, and I've encounter the same. Specifically, it was a input of >165 GB that got stuck at chr2 of stringtie for 2+ days. I have no problem on files ~ 40-80 Gb. So in order for me to proceed, I've split my fastq files into two batches and process them independently. Question, I think this will affect my expression matrices specifically cells and genes might be dropped due to reduce the input by half. I also kept the stringtie step as default -c 2. I'm hoping you might have some questions. Thanks!

nrhorner commented 6 days ago

Hi @wanghlv

Sorry that you are having this issue, it is affecting several users. We aim to replace stringtie in the near future so this issue should be fixed soon.

Splitting the data and processing it will not give the same results as processing all together. For example the number of barcodes maybe slightly different as the whole dataset is used to create an a whitelist of know barcodes and then cell count thresholding is done with this list. The identified transcripts maybe slightly diferent and the filtering of cells and genes may be slightly affected too. All these differences should only affect low abundance barcodes, transcripts, and genes.

I'll let you know when we have a release that fixed this.

Thanks

wanghlv commented 5 days ago

Hi Neil, thank you for explaining! I hope you would implement these changes cause I’m sure there will be more and more people sequencing deeper on their libraries. I was able to sequence two full chips per samples and my largest libraries has more than 300 million reads, after calling bases with super high accuracy mode filtering for >q10. I understand the count matrixes cannot be used after splitting the data. I am only using the tag.bam files from your pipeline for downstream processing through talon, quantification, seurat, would you think it still carry the effect of cell barcode issues? I probably missed something but the tag.bam contain all reads with cell barcodes and UMI that is identified and error corrected? Thanks a lot again. Look forward to the next version fixing some of these issues. Also aside from stringtie some of mine large dataset get stuck at varies steps after stringtie. I have sufficient cpu and ram. So it’s unclear why. Thanks.