CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
493 stars 190 forks source link

group method: not all UMI-containing reads assigned UG tag #635

Open epiliper opened 8 months ago

epiliper commented 8 months ago

Hi! After running group on a sorted bamfile of roughly 3.4 million reads, I noticed that only half (~1.6 million reads) were actually given UG tags.

I checked this by running samtools view -d UG on the grouped file and counting the number of reads meeting these filter conditions.

All reads in the bamfile I used should contain UMIs. Is this expected behavior, and if so, what might cause some reads to not get assigned to a read group?

Thanks in advance for your patience; I'd really appreciate any help/explanation.

I'm a new user so apologies if I'm missing the obvious.

TomSmithCGAT commented 8 months ago

Hi @epiliper - Could you please post the umi_tools group command used. Cheers

epiliper commented 8 months ago

Command used:

umi_tools group -I UMI3-RSV90-10_S2.aligned.sorted.bam --output-bam --umi-separator=":" --paired -S UMI3-RSV90-10_S2_umi.bam

Here is the QNAME of a typical read in my input bamfile:

M04202:286:000000000-L5TBR:1:1102:16626:7061:TGTCAGGCTAT

Where TGTCAGGCTAT is the UMI.

When I run the above command, it runs with these options:

UMI-tools version: 1.1.4
 output generated by group -I UMI3-RSV90-10_S2.aligned.sorted.bam --output-bam --umi-separator=: --paired -S UMI3-RSV90-10_S2_umi.bam
job started at Thu Mar 21 04:04:31 2024 on MacBeth.users.gowaveg.com -- fb9c6e30-e5bd-4b2f-9b5f-62ccf0d1abd9
pid: 40348, system: Darwin 23.4.0 Darwin Kernel Version 23.4.0: Wed Feb 21 21:44:43 PST 2024; root:xnu-10063.101.15~2/RELEASE_ARM64_T6000 arm64
assigned_tag                            : None
cell_tag                                : None
cell_tag_delim                          : None
cell_tag_split                          : -
chimeric_pairs                          : use
chrom                                   : None
compresslevel                           : 6
detection_method                        : None
filter_umi                              : None
gene_tag                                : None
 gene_transcript_map                     : None
get_umi_method                          : read_id
ignore_tlen                             : False
ignore_umi                              : False
in_sam                                  : False
 log2stderr                              : False
loglevel                                : 1
 mapping_quality                         : 0
method                                  : directional
no_sort_output                          : False
out_sam                                 : False
output_bam                              : True
output_unmapped                         : False
paired                                  : True
per_cell                                : False
per_contig                              : False
per_gene                                : False
random_seed                             : None
read_length                             : False
short_help                              : None
skip_regex                              : ^(__|Unassigned)
soft_clip_threshold                     : 4
spliced                                 : False
stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>
stdin                                   : <_io.TextIOWrapper name='UMI3-RSV90-10_S2.aligned.sorted.bam' mode='r' encoding='UTF-8'>
stdlog                                  : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
stdout                                  : <_io.TextIOWrapper name='UMI3-RSV90-10_S2_umi.bam' mode='w' encoding='UTF-8'>
subset                                  : None
threshold                               : 1
timeit_file                             : None
timeit_header                           : None
timeit_name                             : all
tmpdir                                  : None
tsv                                     : None
umi_group_tag                           : BX
umi_sep                                 : :
umi_tag                                 : RX
umi_tag_delim                           : None
umi_tag_split                           : None
umi_whitelist                           : None
umi_whitelist_paired                    : None
unmapped_reads                          : discard
unpaired_reads                          : use
whole_contig                            : False

Thanks again.

epiliper commented 8 months ago

Update: I think I know what might be happening...

Looking at part of the definition of get_bundles() in lines 320-327 of sam-methods:

if read.is_read2:
                if self.return_read2:
                    if not read.is_unmapped or (
                            read.is_unmapped and self.return_unmapped):
                        yield read, None, "single_read"
                continue
            else:
                self.read_events['Input Reads'] += 1

If I'm understanding this right, umi-tools doesn't consider read2 in grouping and downstream analysis. That would explain why roughly half of my reads are getting tagged.

Does this sound like a reasonable cause?

I apologize if I missed this in the docs; I guess checking the percentage of reads tagged with "UG" after running group isn't commonly done.

TomSmithCGAT commented 8 months ago

Second time in a week that an issue has been 'solved' by a user before I get around to it. I should be neglectful more often 😉

Yes, you're right that group doesn't add read tags to read2s, which are just written out as soon as they are read in, along with unmapped and/or chimeric reads depending on the options https://github.com/CGATOxford/UMI-tools/blob/7e799bc120f185128e3983cbc328e180a0b6b263/umi_tools/group.py#L242-L253

Looking over the documentation, I don't think this is stated anywhere. @IanSudbery, do you agree this is an oversight, or have I missed it too?! I'm happy to rectify.

IanSudbery commented 8 months ago

I don't see it in the documentation either. I think it should definitely be documented. I also wonder if a tool similar to prepare-for-rsem could add grouping info to the read2s.

Ian Sudbery (He/Him) Senior Lecturer in Bioinformatics, Sheffield Institute for Nucleic Acids, School of Biosciences, The University of Sheffield.

web: www.sudlab.co.uk Tel: 0114 222 2738 Twitter: IanSudbery Show Calendar Availability https://calendar.google.com/calendar/u/0?cid=aS5zdWRiZXJ5QHNoZWZmaWVsZC5hYy51aw

On Tue, 26 Mar 2024 at 12:52, Tom Smith @.***> wrote:

Second time in a week that an issue has been 'solved' by a user before I get around to it. I should be neglectful more often 😉

Yes, you're right that group doesn't add read tags to read2s, which are just written out as soon as they are read in, along with unmapped and/or chimeric reads depending on the options

https://github.com/CGATOxford/UMI-tools/blob/7e799bc120f185128e3983cbc328e180a0b6b263/umi_tools/group.py#L242-L253

Looking over the documentation, I don't think this is stated anywhere. @IanSudbery https://github.com/IanSudbery, do you agree this is an oversight, or have I missed it too?! I'm happy to rectify.

— Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/635#issuecomment-2020346784, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJELDXZMV55M36RQUT3GNDY2FOPDAVCNFSM6AAAAABFANQ64KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRQGM2DMNZYGQ . You are receiving this because you were mentioned.Message ID: @.***>

TomSmithCGAT commented 8 months ago

OK, I'll put updating the docs regarding this on the to-do.

If we do want to include read tags on read2, I'm torn between allowing this in group and a separate tool to add these afterwards. The former seem cleaner and shouldn't be too hard, but the later will probably be quicker to actually get done! Thoughts?..

epiliper commented 8 months ago

Thank you so much for confirming the cause of half-tagged reads! Peace of mind has been restored.

If you're asking me for my thoughts:

All of this is of course just proposals; Either way, we'd be interested to hear more about this considering it would have a massive impact on how we set up our NGS runs.

Thanks again!