Open epiliper opened 8 months ago
Hi @epiliper - Could you please post the umi_tools group
command used. Cheers
Command used:
umi_tools group -I UMI3-RSV90-10_S2.aligned.sorted.bam --output-bam --umi-separator=":" --paired -S UMI3-RSV90-10_S2_umi.bam
Here is the QNAME of a typical read in my input bamfile:
M04202:286:000000000-L5TBR:1:1102:16626:7061:TGTCAGGCTAT
Where TGTCAGGCTAT is the UMI.
When I run the above command, it runs with these options:
UMI-tools version: 1.1.4
output generated by group -I UMI3-RSV90-10_S2.aligned.sorted.bam --output-bam --umi-separator=: --paired -S UMI3-RSV90-10_S2_umi.bam
job started at Thu Mar 21 04:04:31 2024 on MacBeth.users.gowaveg.com -- fb9c6e30-e5bd-4b2f-9b5f-62ccf0d1abd9
pid: 40348, system: Darwin 23.4.0 Darwin Kernel Version 23.4.0: Wed Feb 21 21:44:43 PST 2024; root:xnu-10063.101.15~2/RELEASE_ARM64_T6000 arm64
assigned_tag : None
cell_tag : None
cell_tag_delim : None
cell_tag_split : -
chimeric_pairs : use
chrom : None
compresslevel : 6
detection_method : None
filter_umi : None
gene_tag : None
gene_transcript_map : None
get_umi_method : read_id
ignore_tlen : False
ignore_umi : False
in_sam : False
log2stderr : False
loglevel : 1
mapping_quality : 0
method : directional
no_sort_output : False
out_sam : False
output_bam : True
output_unmapped : False
paired : True
per_cell : False
per_contig : False
per_gene : False
random_seed : None
read_length : False
short_help : None
skip_regex : ^(__|Unassigned)
soft_clip_threshold : 4
spliced : False
stderr : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>
stdin : <_io.TextIOWrapper name='UMI3-RSV90-10_S2.aligned.sorted.bam' mode='r' encoding='UTF-8'>
stdlog : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
stdout : <_io.TextIOWrapper name='UMI3-RSV90-10_S2_umi.bam' mode='w' encoding='UTF-8'>
subset : None
threshold : 1
timeit_file : None
timeit_header : None
timeit_name : all
tmpdir : None
tsv : None
umi_group_tag : BX
umi_sep : :
umi_tag : RX
umi_tag_delim : None
umi_tag_split : None
umi_whitelist : None
umi_whitelist_paired : None
unmapped_reads : discard
unpaired_reads : use
whole_contig : False
Thanks again.
Update: I think I know what might be happening...
Looking at part of the definition of get_bundles()
in lines 320-327 of sam-methods
:
if read.is_read2:
if self.return_read2:
if not read.is_unmapped or (
read.is_unmapped and self.return_unmapped):
yield read, None, "single_read"
continue
else:
self.read_events['Input Reads'] += 1
If I'm understanding this right, umi-tools
doesn't consider read2 in grouping and downstream analysis. That would explain why roughly half of my reads are getting tagged.
Does this sound like a reasonable cause?
I apologize if I missed this in the docs; I guess checking the percentage of reads tagged with "UG" after running group
isn't commonly done.
Second time in a week that an issue has been 'solved' by a user before I get around to it. I should be neglectful more often 😉
Yes, you're right that group doesn't add read tags to read2s, which are just written out as soon as they are read in, along with unmapped and/or chimeric reads depending on the options https://github.com/CGATOxford/UMI-tools/blob/7e799bc120f185128e3983cbc328e180a0b6b263/umi_tools/group.py#L242-L253
Looking over the documentation, I don't think this is stated anywhere. @IanSudbery, do you agree this is an oversight, or have I missed it too?! I'm happy to rectify.
I don't see it in the documentation either. I think it should definitely be documented. I also wonder if a tool similar to prepare-for-rsem could add grouping info to the read2s.
Ian Sudbery (He/Him) Senior Lecturer in Bioinformatics, Sheffield Institute for Nucleic Acids, School of Biosciences, The University of Sheffield.
web: www.sudlab.co.uk Tel: 0114 222 2738 Twitter: IanSudbery Show Calendar Availability https://calendar.google.com/calendar/u/0?cid=aS5zdWRiZXJ5QHNoZWZmaWVsZC5hYy51aw
On Tue, 26 Mar 2024 at 12:52, Tom Smith @.***> wrote:
Second time in a week that an issue has been 'solved' by a user before I get around to it. I should be neglectful more often 😉
Yes, you're right that group doesn't add read tags to read2s, which are just written out as soon as they are read in, along with unmapped and/or chimeric reads depending on the options
Looking over the documentation, I don't think this is stated anywhere. @IanSudbery https://github.com/IanSudbery, do you agree this is an oversight, or have I missed it too?! I'm happy to rectify.
— Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/635#issuecomment-2020346784, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJELDXZMV55M36RQUT3GNDY2FOPDAVCNFSM6AAAAABFANQ64KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRQGM2DMNZYGQ . You are receiving this because you were mentioned.Message ID: @.***>
OK, I'll put updating the docs regarding this on the to-do.
If we do want to include read tags on read2, I'm torn between allowing this in group
and a separate tool to add these afterwards. The former seem cleaner and shouldn't be too hard, but the later will probably be quicker to actually get done! Thoughts?..
Thank you so much for confirming the cause of half-tagged reads! Peace of mind has been restored.
If you're asking me for my thoughts:
I asked some labmates also involved in UMI NGS, and we think a solution built within UMI-tools group
for R2 processing would be great.
A coworker proposed separating, grouping and deduplicating R1 and R2 in parallel, then aligning them both to the reference in a final bam file.
All of this is of course just proposals; Either way, we'd be interested to hear more about this considering it would have a massive impact on how we set up our NGS runs.
Thanks again!
Hi! After running
group
on a sorted bamfile of roughly 3.4 million reads, I noticed that only half (~1.6 million reads) were actually given UG tags.I checked this by running
samtools view -d UG
on the grouped file and counting the number of reads meeting these filter conditions.All reads in the bamfile I used should contain UMIs. Is this expected behavior, and if so, what might cause some reads to not get assigned to a read group?
Thanks in advance for your patience; I'd really appreciate any help/explanation.
I'm a new user so apologies if I'm missing the obvious.