Get the information of each cell at the pre-processing and mapping steps

cxzhu / Paired-Tag

Analysis of Paired-Tag datasets

MIT License

39 stars 15 forks source link

Get the information of each cell at the pre-processing and mapping steps #3

Closed ytang0831 closed 3 years ago

ytang0831 commented 3 years ago

Hi Chenxu! I encountered some confusion during the process of reproducing your data, but at the pre-processing and mapping steps, the QC information（eg. total reads, mapped reads） seems to come from the entire sample rather than a single cell. So, How can I split the sample into single cells individually, or is there another way to get information about individual cells？ Thanks a lot！

cxzhu commented 3 years ago

Hi @ytang0831,

Yes, the total reads & mapped reads are from all cells within the library in the preprocessing steps. After preprocessing, the cell IDs are attached to the Readnames of fastq files, you will need to ulitilize these information to do the single-cell QC.

Attached is one of the custom script I used to summarize metadata for single-cells, feel free to modify according to your need.

Best, Chenxu

summarize_mapped_reads_cells.pl.zip

ytang0831 commented 3 years ago

@cxzhu Thanks for your quick reply! But I still have some doubts here.

should I merge all sub libraries together? Because I use one sub library and found that the splited reads number in each cell was lower than the number in your paper.
According to your code，my $n_fragments_all = keys %{$unique_all{$cell_id}}; , the number of unique fragments should be the same as unique reads, but the number of unique fragments in your paper supplement files, the values are about half of unique reads，so did you calculate the number of unique fragments after removing duplicates? And other data such as the number of unique reads were calculated before removing duplicates?

cxzhu commented 3 years ago

Hi @ytang0831

Yes, you need to merge all sub-libraries as different sub-libraries were sequenced to different sequencing depths and thus may not represent the overall or most deeply sequenced library.
In Supplementary Table2, the "Uniquely_mapped_DNA/RNA_reads" are all mapped reads that uniquely assigned to a single genomic loci before removing duplicates; the "nFragements_DNA" or "UMI_RNA" are the remaining non-PCR duplicates reads (reads mapped to the same location with same UMI and Cell ID are considered as PCR duplicates). As the PCR duplicates are ~50-60%, it is expected this number is around half of uniquely mapped reads.

ytang0831 commented 3 years ago

@cxzhu Thanks！

ytang0831 commented 3 years ago

By the way, Does merging sub libraries mean merging all fasta.gz togher, then processing or just bind the output matrixes by column?

cxzhu commented 3 years ago

Hi @ytang0831, Yes, both options work. As a reminder, you need to keep cell_IDs from each sub-library distinguishable from other sub-libraries. Please refer to the README or the paper for more details.

ytang0831 commented 3 years ago

Yes, I did overlook this information, Thanks!