gaolabtools / scNanoGPS

Single cell Nanopore sequencing data for Genotype and Phenotype
Other
39 stars 2 forks source link

Split the fastq file into 24 chromosomes #8

Closed kir1to455 closed 1 year ago

kir1to455 commented 1 year ago

Hi, Thank you for developing scNanoGPS! I have a question about the pipeline. Due to the large amount of memory it consumes, can I split the fastq file into 24 chromosomes to run this pipeline?

Best wishes, Kirito

kir1to455 commented 1 year ago

Or can I split fastq according to the number of rows? My fastq.gz file storage takes up 293G.

shiauck commented 1 year ago

Hi Kirito,

Thanks for your questions. Our pipeline is composite of several steps. Theoretically some of the steps can be done separately.

(1) Scanner: scNanoGPS spends about 18 Hrs (fastq.gz file size is about 150GB) for cell barcode recognition. The computing time is determined by your computing node and hard drive I/O. This step doesn't require much memory. Theoretically, you can split your raw reads into batches and it won't affect the barcode scanning result. But the batched result files (processed.fastq.gz and barcode_list.tsv.gz) need to be merged back to master in the same order for the next step.

(2) Assigner: In this step, scNanoGPS consumes a lot of memory for determination of optimal cell barcode number and correct them. We don't recommend using splitted batches because the optimal cell barcodes list could be different between batches, and it's not easy to merge cell barcode lists back to master.

(3) Curator: This step requires computing power to correct reads and generate consensus sequence. It consumes computing power (generating consensus sequence) but not much memory. Yes you can split by different chromosomes (sequentially and exhaustively run through all the chromosomes from reference genome fa files) for batch computing to do mapping, curation, and consensus sequence generating.

(4) Reporters: For now, we implement respectively featureCount, LIQA, and Longshot for expression, isoform, and SNV profiles generating. One of the reasons choosing these tools is feasible computing time/memory usage. You can freely choose other efficient tools in the future.

Hope this helps. Please let me know if you have any questions.

Regards, Cheng-Kai

kir1to455 commented 1 year ago

Hi Cheng-Kai, Thanks for your reply so fast! Part of the test data has successfully completed the curator step. I noticed that the tmp folder barcode.curated.minimp2.bam did not provide the matched genes, transcripts, and sequences. And the barcode.fastq.gz file seems to retained the polyA tail? I want to know these reads are located on which transcripts and I can't find some relevant parameters. Do I need to re align using barcode.fastq.gz?

Best wishes, Kirito

shiauck commented 1 year ago

Hi Kirito,

The barcode.curated.minimap2.bam contains alignments. You can use "samtools view" to check the alignments with consensus reads. Our pipeline doesn't add further annotation into BAM. You can check the instruction of SAM/BAM alignment. (https://samtools.github.io/hts-specs/SAMv1.pdf)

I'm not sure about the "matched genes, transcripts, and sequences" you mentioned. If you are saying the expression profile / isoform profiles, they are generated by reporter_expression.py / reporter_isoform.py, respectively.

It's normal that barcode.fastq.gz retains polyA tails. The scanner.py only trims off adaptor sequences containing barcodes.

Please let me know if you have any further questions. I'm glad to help.

Regards, Cheng-Kai

kir1to455 commented 1 year ago

Hi Cheng-Kai, Thanks for your reply! I found that barcode.curated.minimap2.bam only mapping to the genome.fa with CMD "minimap2 -ax splice -t 30 genome.mmi barcode.consensus.fa" I want to use the barcode.consensus.fa to realign with transcript.fa. However, I noticed that barcode.consensus.fa is a tmp file ? It seems that barcode.consensus.fa has not been output and I only found the barcode.fastq.gz file in tmp file folder.

Thank you again for your timely reply! Best wishes, Kirito

shiauck commented 1 year ago

Hi Kirito,

Curator.py delete temporal consensus FastA file after aligning it back to reference genome, and then merge the alignment back to curated.minimap2.bam. If you want to keep the alignment result, please use the parameter "--keep_meta 1", which will keep all the temporal files. But this parameter is for developing and debugging purpose, you take the risk that the workflow could over-write or accidentally re-use same temporal file name and come out with wrong results. If there's no necessary, please try to avoid using this parameter. Hope this helps.

Regards, Cheng-Kai

kir1to455 commented 1 year ago

Hi Cheng-Kai, Thank you for updating scNanoGPS! I used scNanoGPS-1.1 and found that it is much faster in assigner.py compared to scNanoGPS-1.0. It may have taken a week before, but now it only takes nearly 2 hours. Is this normal? image Here is my code: python $scNanoGPS_dir/assigner.py -i $Output_dir/barcode_list.tsv.gz -d $Output_dir -t 35 I don't think the new parameters will have any issues with my original code.

Thank you for your reply! Best wishes, Kirito

shiauck commented 1 year ago

Hi Kirito,

I created a new issue and quoted your question there. I'm closing this issue. Please let me know if you have further questions and I'm happy to help. Thanks.

Regards, Cheng-Kai