CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
487 stars 190 forks source link

Support for chromium sequencing? #86

Closed danshu closed 7 years ago

danshu commented 7 years ago

Hi,

Chromium sequencing also has the problem of PCR duplicates and does UMI-tools support chromium sequencing? Can I attach barcode to read name and then run "dedup"?

Best, Danshu

TomSmithCGAT commented 7 years ago

Hi @danshu. I've not got any experience with Chromium sequencing but UMI-tools should support it. Was there any particular reason you think it might not?

UMI-tools dedup by default expects the UMI to be in the read name. You can use UMI-tool extract to move the UMI prior to alignment. Alternatively, if you're using an aligner which adds a tag to the read with the UMI, you can specify this with _dedup_by setting the option --extract-umi-method=tag and set the tag name with the --umi-tag option. For example, if your UMIs are encoded in the 'UM' tag, provide the following options: --extract-umi-method=tag --umi-tag=UM

Let me know how you get on

danshu commented 7 years ago

Hi @TomSmithCGAT I thought UMI-tools should also support Chromium sequencing but I didn't found any text mentioning this in the manuscript or in README.rst. So I just want to double confirm before actually starting using it.

I have append barcodes to my read names and barcodes are not in read sequences. Here is an example: @ST-E00129:523:HC2HTALXX:3:1110:6837:51518_AAACACCAGGCATGGTA BX:Z:AAACACCAGGCATGGT-1 So to dedup my reads, after mapping reads to the reference, Can I run: umi_tools dedup --paired -I infile.bam -S deduped.bam -L dedup.log --extract-umi-method=readid --umi-separator="" --output-stats out_stats --method unique

I choose --method unique because my chromium data have been processed using"Long Ranger BASIC" (https://support.10xgenomics.com/genome-exome/software/pipelines/latest/advanced/other-pipelines), which performs basic read and barcode processing including read trimming, barcode error correction, barcode whitelisting, and attaching barcodes to reads. The differences between --method unique and other methods should be that, other methods are trying to deal with errors in UMIs?

In Example UMI extraction (https://github.com/CGATOxford/UMI-tools/blob/master/QUICK_START.md#step-5--deduplication): It is mentioned that "UMI is bases 3-4, bases 1-2 and 5-6 are the sample barcode and need to be removed". But actually bases 1-2 and 5-6 are attached to read id and are treated as UMI?

Another question is that can UMI-tools deal with reads of different lengths? I have trimmed adapters on my reads and thus some reads may become shorter.

TomSmithCGAT commented 7 years ago

Hi @danshu. Thanks very much for pointing out the error in QUICK_START, I'll amend that immediately!

From a quick read of the Long Ranger BASIC documentation you've posted it looks like the BX tag will contain the error corrected barcode. Unfortunately, as this is contained in the optional comment portion of the sequence identifier, this will be lost when you align the reads with a standard aligner (Presumably the Long Ranger ALIGN pipeline is designed to retain this tag in the BAM?)

In the example above, the UMI you attach to the sequence identifier appears to be the raw uncorrected UMI (additional final 'A'). You essentially have two choices for the error correction. If you want to use the BASIC pipeline error correction you'll need to reformat the fastq sequence identifier so that the error corrected UMI is appended (AAACACCAGGCATGGT in the example above). You can then use the 'unique' method as you suggest. Alternatively, if you want to use our error correction method you can append the raw uncorrected UMI, and then use the 'directional', 'adjacency' or 'cluster' methods. I'm not sure how the BASIC pipeline error correction works so I'm not sure which would be best.

UMI-tools dedup will work with adapter-trimmed reads OK. We use the alignment start position to identify duplicate reads so as long as the trimming has only been performed on the 3' end (e.g adapter run through) this will be OK.

TomSmithCGAT commented 7 years ago

Hi again @danshu. After a deeper read of the Long Ranger documentation, I would ignore my comment above and run the Long Ranger BASIC and ALIGN pipelines. If you believe there is a significant problem with PCR duplicates in the final output from the ALIGN pipeline, you can try using UMI-tools dedup specifying that the UMIs are encoded in a BAM tag (this will be supported in the next version which will be available shortly).

IanSudbery commented 7 years ago

If I might chime in? It sounds like you only want to run the BASIC pipeline and not the ALIGN one?

@TomSmithCGAT is right that in all likelihood whatever aligner you use will remove the error corrected barcode specified on the BX tag. You could post-process the reads to attach the barcode to the read name - e.g.

$ zcat input.fastq.gz | sed 's/ /_/g' | gzip > processed.fastq.gz

You'll need to do this for both read ends.

You could then run

$ umi_tools dedup --paired -I infile.bam -S deduped.bam -L dedup.log --extract-umi-method=read_id --umi-separator=":" --output-stats out_stats --method unique

The choice of method here is interesting. I have no idea how well the longranger error correction works. I would be interesting to see what the contents of the out_stats file looked like as this will tell you if a different method is needed.

However, --paired and --output-stats together can make for quite slow operation. If its too slow, try running with --chrom=chr1 to just run on a single chromosome in order to checkout the stats. We'd love to see the results if you won't mind posting them up here?

danshu commented 7 years ago

@TomSmithCGAT @IanSudbery Thanks for your explanations. Actually the additional final 'A' I attached to read id is for distinguishing between two chromium libraries which share some common barcodes. I only run Long Ranger BASIC and not the ALIGN one. My reference genome is not human and I intend to use bwa or bowtie for mapping. I don't care about errors in barcodes. Barcodes with low read count will be filtered in the next step. I'm glad to share my results after running UMI_tools and what kind of results/stastics do you want? output-stats?

IanSudbery commented 7 years ago

Yeah, the contents of the file specified to --output-stats would be great.

As for low count UMIs – we find that there is evidence that even UMIs that are not low count might arise as errors – mostly from PCR rather than from sequencing.

From: danshu [mailto:notifications@github.com] Sent: 01 March 2017 10:19 To: CGATOxford/UMI-tools UMI-tools@noreply.github.com Cc: Ian Sudbery i.sudbery@sheffield.ac.uk; Mention < mention@noreply.github.com> Subject: Re: [CGATOxford/UMI-tools] Support for chromium sequencing? (#86)

@TomSmithCGAT https://github.com/TomSmithCGAT @IanSudbery https://github.com/IanSudbery Thanks for you explanations. Actually the additional final 'A' I attached to read id is for distinguishing between two chromium libraries which share some common barcodes. I only run Long Ranger BASIC and not the ALIGN one. My reference genome is not human and I intend to use bwa or bowtie for mapping. I don't care about errors in barcodes. Barcodes with low read count will be filtered in the next step. I'm glad to share my results after running UMI_tools and what kind of results/stastics do you want? output-stats?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/86#issuecomment-283301105, or mute the thread https://github.com/notifications/unsubscribe-auth/AFJFjpvLpYRGTfIaYiirgWeuMJS7ukDbks5rhUYdgaJpZM4MKzqp .[image: Image removed by sender.]

danshu commented 7 years ago

@TomSmithCGAT @IanSudbery Here are --output-stats of the longest contig. out_stats_per_umi_per_position.tsv counts instances_pre instances_post 1 1685954 1685954 2 99483 99483 3 6279 6279 4 472 472 5 45 45 6 15 15 7 5 5 8 3 3 9 6 6 13 1 1

out_stats_edit_distance.tsv edit_distance unique unique_null Single_UMI 1606296 1606296 0 0 0 1 13585 0 2 7 0 3 20 0 4 18 5 5 38 12 6 94 69 7 195 212 8 504 666 9 1189 1627 10 2462 3320 11 4732 6194 12 8652 10718 13 13806 17131 14 6591 9408 15 2611 4308 16 851 1536 17 128 277 18 0 0 Does it mean that no duplicates were identified?

TomSmithCGAT commented 7 years ago

Thanks @danshu. Could you post the exact command use and the last 10 lines of the log as well.

danshu commented 7 years ago

# threshold                               : 1
# timeit_file                             : None
# timeit_header                           : None
# timeit_name                             : all
# umi_sep                                 : _
# umi_tag                                 : RX
# whole_contig                            : False
2017-03-03 15:47:02,196 INFO Searching for mates for 0 unmatched alignments
2017-03-03 15:47:02,220 INFO 0 mates never found
2017-03-03 15:48:09,347 INFO Input Reads: 1930537, Paired Reads: 1930537, Read 2 unmapped: 20823, Read 1 unmapped: 3628
2017-03-03 15:48:09,348 INFO Number of reads out: 1792263
# job finished in 18670 seconds at Fri Mar  3 15:48:09 2017 -- 18583.25 73.00  0.00  0.00 -- 1bb50664-a71b-43f8-b896-edad4b9fbb55```
IanSudbery commented 7 years ago

So, if I had to guess, I'd say that what is happening here is that the reads have already been de-duplicated on the basis of the barcode by the Long Ranger pipeline. THe pipeline contains a task called "bucket fastq by barcode" - I would say its doing UMI deduping pre-alignment simply on the basis of the UMI (and possibly the read sequence). All of the mapped reads that are entering the dedup are being output, so the file contains no duplicates.

There are bases with multiple UMIs, its just these UMIs are always different. However, for a very large fraction of the these positions the UMIs found at them only differ at a single position, suggesting that these differences are due to errors in the barcode and a dead give-away that these have been de-duplicated using a unique UMIs scheme that doesn't account for errors.

danshu commented 7 years ago

Thanks and I will further check that!