CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
487 stars 190 forks source link

before deduplication BAM file size:102M and after deduplication it is severly reduced to 378K #518

Closed parichitran closed 8 months ago

parichitran commented 2 years ago

Greetings, My code for deduplicting my paired end data:

Step-1:(barcode/umi extraction)**

umi_tools extract -I abc.fastq.gz --bc-pattern=CCCCCCCCNNNNN --read2-in=cba.fastq.gz --stdout=processed.5.fastq.gz --read2-out=processed.6.fastq.gz

Step-2:(Mapping the processed read with STAR)**

STAR --runThreadN 10 --genomeDir path-to/Hg19genome/indext --readFilesIn processed.5.fastq.gz processed.6.fastq.gz --outFileNamePrefix path-to/o --sjdbGTFfile path-to/Hg19genome/Homo_sapiens.GRCh38.gtf --outSAMtype BAM SortedByCoordinate --readFilesCommand zcat

Step-3(Deduping)

umi_tools dedup -I umoAligned.sortedByCoord.out.bam --paired  -S dedup2.bam

My terminal output for above deduping:


# UMI-tools version: 1.1.2
# output generated by dedup -I umoAligned.sortedByCoord.out.bam --paired --output-stats=deduplicated -S dedup4.bam
# job started at Thu Mar  3 13:52:54 2022 on gayenlab -- ec4d478c-b297-42ca-adb9-05cebd42abd6
# pid: 6177, system: Linux 5.13.0-30-generic #33~20.04.1-Ubuntu SMP Mon Feb 7 14:25:10 UTC 2022 x86_64
# assigned_tag                            : None
# cell_tag                                : None
# cell_tag_delim                          : None
# cell_tag_split                          : -
# chimeric_pairs                          : use
# chrom                                   : None
# compresslevel                           : 6
# detection_method                        : None
# filter_umi                              : None
# gene_tag                                : None
# gene_transcript_map                     : None
# get_umi_method                          : read_id
# ignore_tlen                             : False
# ignore_umi                              : False
# in_sam                                  : False
# log2stderr                              : False
# loglevel                                : 1
# mapping_quality                         : 0
# method                                  : directional
# no_sort_output                          : False
# out_sam                                 : False
# output_unmapped                         : False
# paired                                  : True
# per_cell                                : False
# per_contig                              : False
# per_gene                                : False
# random_seed                             : None
# read_length                             : False
# short_help                              : None
# skip_regex                              : ^(__|Unassigned)
# soft_clip_threshold                     : 4
# spliced                                 : False
# stats                                   : deduplicated
# stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>
# stdin                                   : <_io.TextIOWrapper name='umoAligned.sortedByCoord.out.bam' mode='r' encoding='UTF-8'>
# stdlog                                  : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
# stdout                                  : <_io.TextIOWrapper name='dedup4.bam' mode='w' encoding='UTF-8'>
# subset                                  : None
# threshold                               : 1
# timeit_file                             : None
# timeit_header                           : None
# timeit_name                             : all
# tmpdir                                  : None
# umi_sep                                 : _
# umi_tag                                 : RX
# umi_tag_delim                           : None
# umi_tag_split                           : None
# umi_whitelist                           : None
# umi_whitelist_paired                    : None
# unmapped_reads                          : discard
# unpaired_reads                          : use
# whole_contig                            : False
2022-03-03 13:52:54,462 WARNING Chimeric read pairs are being used. Some read pair UMIs may be grouped/deduplicated using just the mapping coordinates from read1.This may also increase the run time and memory usage. Consider --chimeric-pairs==discard to discard these reads or --chimeric-pairs==output (group command only) to output them without grouping
2022-03-03 13:52:54,462 WARNING Unpaired read pairs are being used. Some read pair UMIs may be grouped/deduplicated using just the mapping coordinates from read1.This may also increase the run time and memory usage. Consider --unpared-reads==discard to discard these reads or --unpared-reads==output (group command only) to output them without grouping
2022-03-03 13:52:54,462 INFO command: dedup -I umoAligned.sortedByCoord.out.bam --paired --output-stats=deduplicated -S dedup4.bam
2022-03-03 13:52:56,474 INFO total_umis 5951
2022-03-03 13:52:56,474 INFO #umis 858
2022-03-03 13:53:03,151 INFO Searching for mates for 0 unmatched alignments
2022-03-03 13:53:05,351 INFO 0 mates never found
2022-03-03 13:53:05,462 INFO Reads: Input Reads: 5951, Read pairs: 5951
2022-03-03 13:53:05,462 INFO Number of reads out: 4479
2022-03-03 13:53:05,462 INFO Total number of positions deduplicated: 4466
2022-03-03 13:53:05,462 INFO Mean number of unique UMIs per position: 1.01
2022-03-03 13:53:05,462 INFO Max. number of unique UMIs per position: 3
# job finished in 11 seconds at Thu Mar  3 13:53:05 2022 -- 12.11  0.77  0.00  0.00 -- ec4d478c-b297-42ca-adb9-05cebd42abd6

BAM content before:2760592 lines BAM content after deduping:8454 lines

IanSudbery commented 2 years ago

Thats odd. The log suggests that UMI-tools is only reading in 5951 reads, which is very different from your 2760592. The only way I can see this happening is if you have a very large number of read2 alignments, but a far smaller number of read1 alignments.

What sort of sequencing is this?

parichitran commented 2 years ago

Hello IanSudbery, The above processed data is celseq incorporated with umi(8bpbarcode+5bp random umis) I am here attaching my header of read 1 and read 2 HEADER OF ALL FILES.txt

Thank you so much for you timely reply and for the future replies too

IanSudbery commented 2 years ago

Okay, so, as this is single cell, you need to do things a bit differently.

Firstly, as far as I can tell, read1 contains little or no usable information. Once you have processed the cell barcodes and UMIs from read1, you will want to discard read1 and not use it further (i.e. map only read2, not read1). This means that your data is not paired end, and you should drop the --paired

Secondly, you need to tell UMI tools that it is dealing with data that has separate cells that need to be dealt with separately. You can do this by adding the --per-cell flag, to treat reads from different cells with the same UMI as not duplicates of each other.

Finally, I believe that in CEL-seq, that PCR happens before fragmentation, and thus, mapping position is not informative of duplication status. Thus, you will need to assign reads to genes, and then deduplicate on the basis of assigned gene, rather than mapping position.

The whole process will be very simlar to that for 10X outlined in this tutorial: https://umi-tools.readthedocs.io/en/latest/Single_cell_tutorial.html

parichitran commented 2 years ago

Greetings IanSudbery, I followed what you had recommended for my above data.I only Mapped my Processed Read-2 by STAR And I Assigned read to genes by feature count and then only i deduplicated it.But still I am getting a small bam files with less umi counts.I am here attaching my code and deduplicated details

Barcode extraction**

umi_tools extract -I test1.fastq.gz --bc-pattern=CCCCCCCCNNNNN --read2-in=test2.fastq.gz --stdout=processed.1.fastq.gz --read2-out=processed.2.fastq.gz

Mapping with STAR**

STAR --runThreadN 10 --genomeDir -path-to/indext --readFilesIn processed.2.fastq.gz --outFileNamePrefix /path-to/1 --sjdbGTFfile path-to/Hg19genome/Homo_sapiens.GRCh38.gtf --outSAMtype BAM SortedByCoordinate --readFilesCommand zcat

Assigning reads to genes**

featureCounts -a /Hg19genome/Homo_sapiens.GRCh38.gtf -o gene_assigned -R BAM Aligned.sortedByCoord.out.bam -T 10

Sorting &Indexing the BAM file for downstream**

samtools sort Aligned.sortedByCoord.out.bam.featureCounts.bam -o assigned_sorted.bam samtools index assigned_sorted.bam

Deduplicating and Counting Umi**

umi_tools dedup --per-gene --gene-tag=XT --assigned-status-tag=XS --per-cell -I assigned_sorted.bam -S d6.bam --output-stats=deduplicated --log=LOGFILE umi_tools count --per-gene --gene-tag=XT --assigned-status-tag=XS --per-cell -I assigned_sorted.bam -S counts.tsv.gz --log=LOGFILE

logfiles of Deduplication & Count**

deduplog.txt countlog.txt

Header of sorted BAM after assigning genes (assigned_soted.bam)which is used above**

HEADER OF BAM files.txt

thanks in advance

IanSudbery commented 2 years ago

The logs suggest there has been an improvement of ~5,000 reads surviving deduplication to ~50,000 reads, which is a ten fold improvement. Note however, that 3M reads are being skipped because they are not assigned to any gene.

parichitran commented 2 years ago

yeah, I agree IanSudbery comparing to previous deduplication,its increasing to ten fold now.My concern is those unassigned huge reads only because people whoever processed this same dataset had 1,12,819 umi reads after deduplication. Actually they processed read in following sequence 1) alligned the reads to RefSeq transcriptome>2)Unaligned reads were then aligned to the genome using Bowtie>3)again mapped unalligned with GSNAP and finally they collpased umi reads mapping to exon of same gene.(But they didnt mentioned about the tools and how they deduplicated it) As per umi_tools single cell protocol the above mentioned their method can also be used but they strongly suggested to use mapping with genome and assign genes to read which i was followed. My question is will the read count varies because of the processing methods or if any way that i can too get the same level of umi reads like them by tweaking the 3M skipped reads

IanSudbery commented 2 years ago

What isn't mentioned in your description is what they do with reads that multimap, In the sample you provided, all the unassigned genes were unassigned because they were multimapped.

There is no reason you couldn't follow their proceedure, but using UMI-tools to do the collapsing. If you have mapped to the transcriptome, you can use the --per-contig switch, rather than the --per-gene switch, to treat all reads aligned to the same contig as being from the same transcript. The problem with this is that almost all the reads will be multimapping (as most expressed regions of the genome are present in multiple isoforms). The problem with a read mapping to multiple locations is that it is not clear what to do if is selected to be kept at one location, but not another. This is why we generally recommend aligning to the genome, rather than the transcriptome.

One alternative would be to use the switches -M --primary-only on featureCounts - this would allow multiplemapped reads to be used, but would only consider one alignment from each read (generally chosen at random). This still has the problem that UMI networks might be broken by read B in a A -> B -> C umi chain being randomly allocated to a different location from A and C, but it might represent a reasonable compromise in the circumstances.

I'll just finally note that if the original study didn't use UMI-tools, then it is likely that you will find fewer reads post deduplication, as we implement error correction that most other tools don't, and this leads to more reads being collapsed. Although I wouldn't expect that to be 1M -> 50k.

parichitran commented 2 years ago

Regarding multimap they didnt mentioned any details about it. Then I actually tried to map the processed read with STAR by --quantMode TranscriptomeSAM options and tried to deduplicate/count the reads with per contig per gene.But there I am getting only zero reads.

STAR Mapping **

STAR --runThreadN 10 --genomeDir path-to/Hg19genome/indext --readFilesIn processed.2.fastq.gz --outFileNamePrefix path-to/o --sjdbGTFfile path-to/Hg19genome/Homo_sapiens.GRCh38.gtf --outSAMtype BAM SortedByCoordinate --quantMode TranscriptomeSAM--readFilesCommand zcat

I got this Alligned file to transcriptome "Aligned.toTranscriptome.out.bam" then i sorted the above file for deuplication.

Transcript-to-gene map generation

zless -S /pathto/Homo_sapiens.GRCh38.gtf | grep -v "#" | awk '$3=="transcript"' | cut -f9 | tr -s ";" " " | awk '{print$6"\t"$2}' | sort | uniq | sed 's/\"//g' | tee q1.txt

UMI_Deduplication/Count

umi_tools count --per-contig --per-gene --gene-transcript-map=q1.txt -I sort.bam -S o.tsv

I got the following output log.txt

I am here attach my header of BAM file from STAR and the Transcript-gene-map.txt file transriptomeBAM.txt tsx2gene_ensemble.txt

IanSudbery commented 2 years ago

In the tsx2gene map, you want genes in column 1 and transcripts in column 2, not vice-versa.

parichitran commented 2 years ago

Thanks a lot IanSudbery for your response, Yeah now it is running fine Iansudbery.

Again Ian in the above Trancriptome mapping method with STAR when I pass multimapping filter --outFilterMultimapNmax 1 i got only 61905 reads and Without running multimapping parameter 95239 reads(Default STAR parameter) Even i checked deduplication percent for both after deduplication is 0.087039 ,0.082774 percents respectively using picard markduplicates I agree my reads increased to verymuch when compared to previous times.My question is that 95239 reads will be fine,because that was very close to 1,12,819 reads which was in the original paper.

IMPORNTANT INFORMATION:Reads mapped within 1kb of the 3′ end of a gene and was in the same orientation as the gene, they assessed it as a transcript-mapping read for that gene.

parichitran commented 2 years ago

Again if i change --outFilterMultimapNmax 50 i got 100684 reads but can only get 58.7% genes covered from the original data Please help me to find how to get same reads and genes as same as them Thanks in advance

TomSmithCGAT commented 8 months ago

I'm closing due to inactivity