Closed parichitran closed 8 months ago
Thats odd. The log suggests that UMI-tools is only reading in 5951 reads, which is very different from your 2760592. The only way I can see this happening is if you have a very large number of read2 alignments, but a far smaller number of read1 alignments.
What sort of sequencing is this?
Hello IanSudbery, The above processed data is celseq incorporated with umi(8bpbarcode+5bp random umis) I am here attaching my header of read 1 and read 2 HEADER OF ALL FILES.txt
Thank you so much for you timely reply and for the future replies too
Okay, so, as this is single cell, you need to do things a bit differently.
Firstly, as far as I can tell, read1 contains little or no usable information. Once you have processed the cell barcodes and UMIs from read1, you will want to discard read1 and not use it further (i.e. map only read2, not read1). This means that your data is not paired end, and you should drop the --paired
Secondly, you need to tell UMI tools that it is dealing with data that has separate cells that need to be dealt with separately. You can do this by adding the --per-cell
flag, to treat reads from different cells with the same UMI as not duplicates of each other.
Finally, I believe that in CEL-seq, that PCR happens before fragmentation, and thus, mapping position is not informative of duplication status. Thus, you will need to assign reads to genes, and then deduplicate on the basis of assigned gene, rather than mapping position.
The whole process will be very simlar to that for 10X outlined in this tutorial: https://umi-tools.readthedocs.io/en/latest/Single_cell_tutorial.html
Greetings IanSudbery, I followed what you had recommended for my above data.I only Mapped my Processed Read-2 by STAR And I Assigned read to genes by feature count and then only i deduplicated it.But still I am getting a small bam files with less umi counts.I am here attaching my code and deduplicated details
umi_tools extract -I test1.fastq.gz --bc-pattern=CCCCCCCCNNNNN --read2-in=test2.fastq.gz --stdout=processed.1.fastq.gz --read2-out=processed.2.fastq.gz
STAR --runThreadN 10 --genomeDir -path-to/indext --readFilesIn processed.2.fastq.gz --outFileNamePrefix /path-to/1 --sjdbGTFfile path-to/Hg19genome/Homo_sapiens.GRCh38.gtf --outSAMtype BAM SortedByCoordinate --readFilesCommand zcat
featureCounts -a /Hg19genome/Homo_sapiens.GRCh38.gtf -o gene_assigned -R BAM Aligned.sortedByCoord.out.bam -T 10
samtools sort Aligned.sortedByCoord.out.bam.featureCounts.bam -o assigned_sorted.bam samtools index assigned_sorted.bam
umi_tools dedup --per-gene --gene-tag=XT --assigned-status-tag=XS --per-cell -I assigned_sorted.bam -S d6.bam --output-stats=deduplicated --log=LOGFILE umi_tools count --per-gene --gene-tag=XT --assigned-status-tag=XS --per-cell -I assigned_sorted.bam -S counts.tsv.gz --log=LOGFILE
thanks in advance
The logs suggest there has been an improvement of ~5,000 reads surviving deduplication to ~50,000 reads, which is a ten fold improvement. Note however, that 3M reads are being skipped because they are not assigned to any gene.
yeah, I agree IanSudbery comparing to previous deduplication,its increasing to ten fold now.My concern is those unassigned huge reads only because people whoever processed this same dataset had 1,12,819 umi reads after deduplication. Actually they processed read in following sequence 1) alligned the reads to RefSeq transcriptome>2)Unaligned reads were then aligned to the genome using Bowtie>3)again mapped unalligned with GSNAP and finally they collpased umi reads mapping to exon of same gene.(But they didnt mentioned about the tools and how they deduplicated it) As per umi_tools single cell protocol the above mentioned their method can also be used but they strongly suggested to use mapping with genome and assign genes to read which i was followed. My question is will the read count varies because of the processing methods or if any way that i can too get the same level of umi reads like them by tweaking the 3M skipped reads
What isn't mentioned in your description is what they do with reads that multimap, In the sample you provided, all the unassigned genes were unassigned because they were multimapped.
There is no reason you couldn't follow their proceedure, but using UMI-tools to do the collapsing. If you have mapped to the transcriptome, you can use the --per-contig
switch, rather than the --per-gene
switch, to treat all reads aligned to the same contig as being from the same transcript. The problem with this is that almost all the reads will be multimapping (as most expressed regions of the genome are present in multiple isoforms). The problem with a read mapping to multiple locations is that it is not clear what to do if is selected to be kept at one location, but not another. This is why we generally recommend aligning to the genome, rather than the transcriptome.
One alternative would be to use the switches -M --primary-only
on featureCounts - this would allow multiplemapped reads to be used, but would only consider one alignment from each read (generally chosen at random). This still has the problem that UMI networks might be broken by read B in a A -> B -> C umi chain being randomly allocated to a different location from A and C, but it might represent a reasonable compromise in the circumstances.
I'll just finally note that if the original study didn't use UMI-tools, then it is likely that you will find fewer reads post deduplication, as we implement error correction that most other tools don't, and this leads to more reads being collapsed. Although I wouldn't expect that to be 1M -> 50k.
Regarding multimap they didnt mentioned any details about it. Then I actually tried to map the processed read with STAR by --quantMode TranscriptomeSAM options and tried to deduplicate/count the reads with per contig per gene.But there I am getting only zero reads.
STAR --runThreadN 10 --genomeDir path-to/Hg19genome/indext --readFilesIn processed.2.fastq.gz --outFileNamePrefix path-to/o --sjdbGTFfile path-to/Hg19genome/Homo_sapiens.GRCh38.gtf --outSAMtype BAM SortedByCoordinate --quantMode TranscriptomeSAM--readFilesCommand zcat
I got this Alligned file to transcriptome "Aligned.toTranscriptome.out.bam" then i sorted the above file for deuplication.
zless -S /pathto/Homo_sapiens.GRCh38.gtf | grep -v "#" | awk '$3=="transcript"' | cut -f9 | tr -s ";" " " | awk '{print$6"\t"$2}' | sort | uniq | sed 's/\"//g' | tee q1.txt
umi_tools count --per-contig --per-gene --gene-transcript-map=q1.txt -I sort.bam -S o.tsv
I got the following output log.txt
I am here attach my header of BAM file from STAR and the Transcript-gene-map.txt file transriptomeBAM.txt tsx2gene_ensemble.txt
In the tsx2gene map, you want genes in column 1 and transcripts in column 2, not vice-versa.
Thanks a lot IanSudbery for your response, Yeah now it is running fine Iansudbery.
Again Ian in the above Trancriptome mapping method with STAR when I pass multimapping filter --outFilterMultimapNmax 1 i got only 61905 reads and Without running multimapping parameter 95239 reads(Default STAR parameter) Even i checked deduplication percent for both after deduplication is 0.087039 ,0.082774 percents respectively using picard markduplicates I agree my reads increased to verymuch when compared to previous times.My question is that 95239 reads will be fine,because that was very close to 1,12,819 reads which was in the original paper.
IMPORNTANT INFORMATION:Reads mapped within 1kb of the 3′ end of a gene and was in the same orientation as the gene, they assessed it as a transcript-mapping read for that gene.
Again if i change --outFilterMultimapNmax 50 i got 100684 reads but can only get 58.7% genes covered from the original data Please help me to find how to get same reads and genes as same as them Thanks in advance
I'm closing due to inactivity
Greetings, My code for deduplicting my paired end data:
Step-1:(barcode/umi extraction)**
Step-2:(Mapping the processed read with STAR)**
Step-3(Deduping)
My terminal output for above deduping:
BAM content before:2760592 lines BAM content after deduping:8454 lines