CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
493 stars 190 forks source link

Problems with dedup output and RSEM #465

Closed lcabus-flomics closed 8 months ago

lcabus-flomics commented 3 years ago

Hi,

I'm using umi_tools dedup to remove PCR duplicates from an alignment to the transcriptome with STAR. After the deduplication, when I run RSEM, it seems that there are some reads from the pairs that are lost since the program exits with the following error: Read ST-E00114:1178:HFL75CCX2:7:1101:1610:55297_TTGCCATCTC: The adjacent two lines do not represent the two mates of a paired-end read! (RSEM assumes the two mates of a paired-end read should be adjacent)

The ran the command with --paired --multimapping-detection-method=NH --unpaired-reads=discard --chimeric-pairs=discard --unmapped_reads=discard

I have seen that this problem was already discussed in #384, but there is not an option on how to solve this. Do you have any idea on how to solve this issue or a workaround that could work for this case?

Thank you very much

IanSudbery commented 2 years ago

I think I've identified the problem, but I'm still coming up with a fix. Unfortunately (or not depending on your point of view!) I'm off on my Christmas break tomorrow. I'll have a look again when I get back.

avivdemorgan commented 2 years ago
body p { margin-bottom: 0cm; margin-top: 0pt; } 

Hi Ian,

Many thanks for your work on this, and enjoy your Christmas
vacation.

Best wishes,
Aviv.

On 20/12/2021 2:50, Ian Sudbery wrote:

  I think I've identified the problem, but I'm still
    coming up with a fix. Unfortunately (or not depending on your
    point of view!) I'm off on my Christmas break tomorrow. I'll
    have a look again when I get back.
  —
    Reply to this email directly, view it on GitHub, or unsubscribe.
    Triage notifications on the go with GitHub Mobile for iOS or Android.

    You are receiving this because you were mentioned.Message
      ID: ***@***.***>
  [

{ @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/CGATOxford/UMI-tools/issues/465#issuecomment-997500081", "url": "https://github.com/CGATOxford/UMI-tools/issues/465#issuecomment-997500081", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

-- 

Aviv De Morgan, Ph.D Head of Bioinformatics Pyxis Diagnostics High Tech Village Givat-Ram, POB 39158 Jerusalem 91391 Israel +972-2-6553333 @.***

IanSudbery commented 2 years ago

Okay, running umi_tools prepare-for-rsem on the sorted dedup output the file @ctuni sent me now passes rsem-sam-validator. I've also sorted out the source of the missing mates.

Give the lastest commit on the prepare-for-rsem branch a go.

ctuni commented 2 years ago

Hi! I hope this holiday break was enjoyable for everyone :) I have been able to test the new branch and it seems it is working fine.

Just to check everything is working as expected, this is the output of prepare-for-rsem command when used on the dedup file I sent:

umi_tools prepare-for-rsem -I SRR9113885_dedup_sorted.out.bam --stdout=ready_for_rsem.out.bam
# UMI-tools version: 1.1.2
# output generated by prepare-for-rsem -I SRR9113885_dedup_sorted.out.bam --stdout=ready_for_rsem.out.bam
# job started at Wed Dec 29 14:28:02 2021 on ip-172-31-20-55 -- 1c2dd3b9-cebc-4234-9fdb-0077dd9e10ae
# pid: 2173, system: Linux 5.11.0-1022-aws #23~20.04.1-Ubuntu SMP Mon Nov 15 14:03:19 UTC 2021 x86_64
# compresslevel                           : 6
# log2stderr                              : False
# loglevel                                : 1
# random_seed                             : None
# sam                                     : False
# short_help                              : None
# stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>
# stdin                                   : <_io.TextIOWrapper name='SRR9113885_dedup_sorted.out.bam' mode='r' encoding='UTF-8'>
# stdlog                                  : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
# stdout                                  : <_io.TextIOWrapper name='ready_for_rsem.out.bam' mode='w' encoding='UTF-8'>
# tags                                    : UG,BX
# timeit_file                             : None
# timeit_header                           : None
# timeit_name                             : all
# tmpdir                                  : None
[E::idx_find_and_load] Could not retrieve index file for 'SRR9113885_dedup_sorted.out.bam'
2021-12-29 14:29:50,823 WARNING Alignment SRR9113885.2278629_TTGGCCTTCA 419 ENST00000302182 3 has no mate -- skipped
2021-12-29 14:30:50,506 INFO Total pairs output: 1854064, Pairs skipped - no mates: 1, Pairs skipped - not read1 or 2: 0
# job finished in 167 seconds at Wed Dec 29 14:30:50 2021 -- 169.00  2.47  0.00  0.00 -- 1c2dd3b9-cebc-4234-9fdb-0077dd9e10ae

The output of rsem-calculate-expression is quite long but it gives "success" messages along the way. Looking at the files produced by it, I think it is working properly. Those are the first lines of final.genes.results, and they look correct to me!

gene_id transcript_id(s)        length  effective_length        expected_count  TPM     FPKM
ENSG00000000003 ENST00000373020,ENST00000494424,ENST00000496771,ENST00000612152,ENST00000614008 3768.00 3602.39 1.00    0.14    0.30
ENSG00000000005 ENST00000373031,ENST00000485971 873.50  707.96  0.00    0.00    0.00
ENSG00000000419 ENST00000371582,ENST00000371584,ENST00000371588,ENST00000413082,ENST00000466152,ENST00000494752 789.00  623.40  1.00    0.80    1.74
ENSG00000000457 ENST00000367770,ENST00000367771,ENST00000367772,ENST00000423670,ENST00000470238 1538.00 1372.39 1.00    0.36    0.79
ENSG00000000460 ENST00000286031,ENST00000359326,ENST00000413811,ENST00000459772,ENST00000466580,ENST00000472795,ENST00000481744,ENST00000496973,ENST00000498289 3849.00 3683.39 3.00    0.40    0.88

Thank you very much for everything, I will test the new branch more extensively, but I think it is working correctly now! We can discuss it further if needed, but again, infinite thanks for the work and patience :)

avivdemorgan commented 2 years ago
body p { margin-bottom: 0cm; margin-top: 0pt; } 

Hi Ian,

These are wonderful news!
Thanks for your help and hard work.
I will pull and test.

Best wishes,
Aviv.

On 29/12/2021 1:03, Ian Sudbery wrote:

  Okay, running umi_tools prepare-for-rsem
    on the sorted dedup output the file @ctuni
    sent me now passes rsem-sam-validator. I've also
    sorted out the source of the missing mates.
  Give the lastest commit on the prepare-for-rsem
    branch a go.
  —
    Reply to this email directly, view it on GitHub, or unsubscribe.
    Triage notifications on the go with GitHub Mobile for iOS or Android.

    You are receiving this because you were mentioned.Message
      ID: ***@***.***>
  [

{ @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/CGATOxford/UMI-tools/issues/465#issuecomment-1002314402", "url": "https://github.com/CGATOxford/UMI-tools/issues/465#issuecomment-1002314402", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

-- 

Aviv De Morgan, Ph.D Head of Bioinformatics Pyxis Diagnostics High Tech Village Givat-Ram, POB 39158 Jerusalem 91391 Israel +972-2-6553333 @.***

avivdemorgan commented 2 years ago
body p { margin-bottom: 0cm; margin-top: 0pt; } 

Hi Cristina (and All),

Thanks for sharing this with us.
I will pull and test as well.

Best wishes,
Aviv.

On 29/12/2021 15:40, Cristina Tuñí i
  Domínguez wrote:

  Hi! I hope this holiday break was enjoyable for
    everyone :) I have been able to test the new branch and it seems
    it is working fine.
  Just to check everything is working as expected,
    this is the output of prepare-for-rsem command
    when used on the dedup file I sent:
  umi_tools prepare-for-rsem -I SRR9113885_dedup_sorted.out.bam --stdout=ready_for_rsem.out.bam

UMI-tools version: 1.1.2

output generated by prepare-for-rsem -I SRR9113885_dedup_sorted.out.bam --stdout=ready_for_rsem.out.bam

job started at Wed Dec 29 14:28:02 2021 on ip-172-31-20-55 -- 1c2dd3b9-cebc-4234-9fdb-0077dd9e10ae

pid: 2173, system: Linux 5.11.0-1022-aws #23~20.04.1-Ubuntu SMP Mon Nov 15 14:03:19 UTC 2021 x86_64

compresslevel : 6

log2stderr : False

loglevel : 1

random_seed : None

sam : False

short_help : None

stderr : <_io.TextIOWrapper name='' mode='w' encoding='utf-8'>

stdin : <_io.TextIOWrapper name='SRR9113885_dedup_sorted.out.bam' mode='r' encoding='UTF-8'>

stdlog : <_io.TextIOWrapper name='' mode='w' encoding='utf-8'>

stdout : <_io.TextIOWrapper name='ready_for_rsem.out.bam' mode='w' encoding='UTF-8'>

tags : UG,BX

timeit_file : None

timeit_header : None

timeit_name : all

tmpdir : None

[E::idx_find_and_load] Could not retrieve index file for 'SRR9113885_dedup_sorted.out.bam' 2021-12-29 14:29:50,823 WARNING Alignment SRR9113885.2278629_TTGGCCTTCA 419 ENST00000302182 3 has no mate -- skipped 2021-12-29 14:30:50,506 INFO Total pairs output: 1854064, Pairs skipped - no mates: 1, Pairs skipped - not read1 or 2: 0

job finished in 167 seconds at Wed Dec 29 14:30:50 2021 -- 169.00 2.47 0.00 0.00 -- 1c2dd3b9-cebc-4234-9fdb-0077dd9e10ae

  The output of rsem-calculate-expression
    is quite long but it gives "success" messages along the way.
    Looking at the files produced by it, I think it is working
    properly. Those are the first lines of final.genes.results,
    and they look correct to me!
  gene_id transcript_id(s)        length  effective_length        expected_count  TPM     FPKM

ENSG00000000003 ENST00000373020,ENST00000494424,ENST00000496771,ENST00000612152,ENST00000614008 3768.00 3602.39 1.00 0.14 0.30 ENSG00000000005 ENST00000373031,ENST00000485971 873.50 707.96 0.00 0.00 0.00 ENSG00000000419 ENST00000371582,ENST00000371584,ENST00000371588,ENST00000413082,ENST00000466152,ENST00000494752 789.00 623.40 1.00 0.80 1.74 ENSG00000000457 ENST00000367770,ENST00000367771,ENST00000367772,ENST00000423670,ENST00000470238 1538.00 1372.39 1.00 0.36 0.79 ENSG00000000460 ENST00000286031,ENST00000359326,ENST00000413811,ENST00000459772,ENST00000466580,ENST00000472795,ENST00000481744,ENST00000496973,ENST00000498289 3849.00 3683.39 3.00 0.40 0.88

  Thank you very much for everything, I will test the
    new branch more extensively, but I think it is working correctly
    now! We can discuss it further if needed, but again, infinite
    thanks for the work and patience :)
  —
    Reply to this email directly, view it on GitHub, or unsubscribe.
    Triage notifications on the go with GitHub Mobile for iOS or Android.

    You are receiving this because you were mentioned.Message
      ID: ***@***.***>
  [

{ @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/CGATOxford/UMI-tools/issues/465#issuecomment-1002599490", "url": "https://github.com/CGATOxford/UMI-tools/issues/465#issuecomment-1002599490", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

-- 

Aviv De Morgan, Ph.D Head of Bioinformatics Pyxis Diagnostics High Tech Village Givat-Ram, POB 39158 Jerusalem 91391 Israel +972-2-6553333 @.***

epruesse commented 2 years ago

I can confirm that your new script is necessary and sufficient for STAR + Salmon quantification of bulk RNA-seq data with UMIs.

@IanSudbery - could you summarize the issue as you understand it real quick? I think I've gotten lost in this thread. Where exactly do STAR, UMI-tools and RSEM/Salmon differ in their expectations of "proper BAM" contents?

The flow currently seems to be:

The name "prepare-for-rsem" might be misleading, since it's necessary for Salmon, too. (It completes, but gives warnings and bogus results). Perhaps there is a better choice. Ideally, of course, the "dedup" output would not have to be "fixed" for Salmon/RSEM.

Last question - when would you expect the new script or even a real fix to be released? Helps a lot if it can pass through the whole bioconda packaging thing and be available for pipelines without messing with branches. (I know, no time, me neither..., but pretty please perhaps?)

IanSudbery commented 2 years ago

I'm definitely open to suggestions for a more elegant fix, so let me know if you have one.

Salmon/RSEM expect that BAM files are sorted such that each read is immediately followed by its mate. This means that there must be exactly read 2 for every read 1 and they must immediately follow each other in the BAM file. STAR outputs alignments in this manner. In each pair of reads the mpos and mnext fields point to the mate alignment for that read (i.e. the one that immediately follows/precedes it).

UMI-tools requires that the BAM file is position sorted. How else can it be sure that it has seen all the reads mapping to a given location before deduplicating them. However, that means that when it marks a read for retention, it must also find the mate for that read, which it does using the mnext and mpos fields. However, the SAM format specification says that the mpos and mnext field should point to the primary alignment of the mate, not the one that might naturally be assumed to pair with it (i.e. the one that would come next in the STAR output). Different aligners also differ in how many times they will output read2 if it aligns to one location, but read1 aligns to two. STAR will output read2 twice, where as BWA will only output it once. Consider the following two situations:

read1(primary @ pos 100)            read2 (primary @ pos 120)
read1(secondary @ pos 200)

STAR will output

read1(primary @ pos 100, mate @ pos 120)
read2(primary @ pos 120, mate @ pos 100)
read1(secondary @ pos 200, mate @ pos120)
**read2(secondary @ pos 120, mate @ pos 200)** <--- this alignment is redundant

where as BWA will output:

read1(primary @ pos 100, mate @ pos 120)
read2(primary @ pos 120, mate @ pos 100)
read1(secondary @ pos 200, mate @ pos120)

However, now consider:

read1(primary @ pos 100)            read2 (primary @ pos 120)
read1(secondary @ pos 200)        read2 (secondary @ pos 220)

STAR will output:

read1(primary @ pos 100, mate @ pos 120)
read2(primary @ pos 120, mate @ pos 100)
read1(secondary @ pos 200, mate @ pos220) <--- invalid SAM, mpos points to secondary alignment
read2(secondary @ pos 220, mate @ pos 200) 

BWA will output:

read1(primary @ pos 100, mate @ pos 120)
read2(primary @ pos 120, mate @ pos 100)
read1(secondary @ pos 200, mate @ pos120)  <--- This read is now valid SAM, but points to the "wrong" mate
read2(secondary @ pos 220, mate @ pos 200) <--- This read is never reffered to as the mate of anything, and so will never be output by UMI-tools

This is a problem, because when UMI tools decides to output a pair (based on the information in read1), it adds read2 to a set of things to look for, once it has been seen, and output, then it removes it from the set. So each read2 is only ever output once.

It may be possible to add a counter, so that if two read1s point to the same read2, then UMI-tools knows to output it twice. Trouble is, what then happens if the input file is from BWA?

There are other problems with multimapped reads because you can get different answers to the question "is this read unique?" at different mapping locations, even though the read must either be from a unique biological molecule or be a PCR duplicate, and really, you'd like to remove a read if any of the alignments for its are non-unique, but that is not possible in the current UMI-tools framework.

It has been suggested that perhaps we should just disallow multimapping paired end sequencing, and say that it is Salmon/RSEMs job to deal with it properly (Salmon-alevin already has a very clever algorithm to do this in the single cell situation). This would mean we would effectively be preventing UMI-tools from being used with transcript based quantification tools.

IanSudbery commented 2 years ago

As for release schedule: I can't see any reason not to include this script in the next release. We have a set of things we'd like to tidy up before the next release, but its probably not a million miles off.

epruesse commented 2 years ago

Thank you for the great explanation! I wish I had a good, clean idea, but all I can do is commiserate. I remember cursing the SAM/BAM format for not telling me in the primary whether a secondary would be coming up later. You end up keeping everything in memory or doing two passes on the SAM/BAM or both.

Technically, special paths for STAR/BWA aren't an issue. You can parametrize the BAM "flavor" and detect proper settings for known tools from @PG. Ugly, but possible and a classic bane of our business... (try parsing newick trees).

One last question, since I probably still don't fully comprehend the complexity: Would you say the above workflow outline is valid, or does this break assumptions RSEM/Salmon make about the presence of secondary alignments in the BAM?

IanSudbery commented 2 years ago

Parsing from @PG is not something I had considered.....

I think it might be possible to output RSEM/Salmon compatible output from STAR aligned files if we knew at the time that is what we were doing. I'll have to think.

Really the best way to do it would be to align to genome, keep only primary alignments, dedup, and then transfer to transcriptome coordinates.

The trouble with the above work flow:

UMI-tools extract QC/trim STAR map coordinate sort UMI-tools dedup name sort / collate UMI-tools prepare-for-rsem Salmon / RSEM

Is that you might end up with a situation where a read is kept at one location, but removed at another.

Also, if read1 points to the "wrong" read, than prepare-for-rsem will output that, rather than the cognate alignment, and then salmon/RSEM will probably treat it as a split alignment (i.e. read1 and read2 will align to different "contigs")

lkw159159 commented 2 years ago

Dear Ian,

when I tried to run prepare-for-rsem, I got 'frozen importlib error

How to correct it? thanks image

IanSudbery commented 2 years ago

Hi,

prepare-for-rsem is not yet in a release version of UMI-tools. To use it, you will need to download the correct branch from the github repository and install from there:

$ wget https://github.com/CGATOxford/UMI-tools/archive/refs/heads/%7BIS%7D_prepare-for-rsem.zip
$ unzip \{IS\}_prepare-for-rsem.zip
$ cd UMI-tools--IS-_prepare-for-rsem/
$ python setup.py install
drpatelh commented 2 years ago

Hi @IanSudbery ! Hope you are well! I was wondering if it would be possible to include the umi_tools prepare-for-rsem command in an official release. We stumbled on this issue today with STAR/Salmon too and your script seems to do the trick before passing the BAM file to Salmon. We use containers for everything in nf-core/rnaseq so once we have an official release then hopefully we can get it through Bioconda and get a Biocontainer for free.

Thanks in advance!

TomSmithCGAT commented 8 months ago

From v1.1.5, prepare-for-rsem is part of umi_tools - See #609 for PR

EvaMarkovic commented 6 months ago

Hi!

Sorry for re-opening this issue. I'm having a problem similar to the ones mentioned above when I try to prepare my bam file for RSEM.

I tried following all the steps but I'm worried it does not work for me properly.

I mapped my reads to the reference using Bowtie2 and then did the following:

  1. samtools sort -o PaMO_CUT_1.transcript_sorted.bam -O BAM --threads 32 --write-index --reference ../PaMO_transcriptome/PaMO_transcriptome.Trinity.fasta PaMO_CUT_1.transcript.bam
  2. umi_tools dedup -I PaMO_CUT_1.transcript_sorted.bam --stdout=PaMO_CUT_1.transcript_sorted_dedup.bam --paired --chimeric-pairs=discard --unpaired-reads=discard
  3. samtools sort -n -o PaMO_CUT_1.transcript_sorted_dedup_sorted.bam -O BAM --threads 32 --reference ../PaMO_transcriptome/PaMO_transcriptome.Trinity.fasta PaMO_CUT_1.transcript_sorted_dedup.bam
  4. umi_tools prepare-for-rsem -I PaMO_CUT_1.transcript_sorted_dedup_sorted.bam --stdout=PaMO_CUT_1_ready_for_rsem.bamumi_tools prepare-for-rsem -I PaMO_CUT_1.transcript_sorted_dedup_sorted.bam --stdout=PaMO_CUT_1_ready_for_rsem.bam

At the fourth step (umi_tools prepare-for-rsem) I've encountered a lot of warnings. There are thousands? of these warnings (I included only the top few):

`# UMI-tools version: 1.1.5`
`# output generated by prepare-for-rsem -I PaMO_CUT_1.transcript_sorted_dedup_sorted.bam --stdout=PaMO_CUT_1_ready_for_rsem.bam`
`# job started at Wed May 15 17:10:33 2024 on c206-1 -- 0f1a9ab9-2ed7-4081-ac2b-ae23e9feba6f`
`# pid: 396981, system: Linux 4.18.0-305.3.1.el8_4.x86_64 #1 SMP Thu Jun 17 07:52:48 UTC 2021 x86_64`
`# compresslevel                           : 6`
`# log2stderr                              : False`
`# loglevel                                : 1`
`# random_seed                             : None`
`# sam                                     : False`
`# short_help                              : None`
`# stderr                                  : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='utf-8'>`
`# stdin                                   : <_io.TextIOWrapper name='PaMO_CUT_1.transcript_sorted_dedup_sorted.bam' mode='r' encoding='UTF-8'>`
`# stdlog                                  : <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>`
`# stdout                                  : <_io.TextIOWrapper name='PaMO_CUT_1_ready_for_rsem.bam' mode='w' encoding='UTF-8'>`
`# tags                                    : UG,BX`
`# timeit_file                             : None`
`# timeit_header                           : None`
`# timeit_name                             : all`
`# tmpdir                                  : None`

2024-05-15 17:10:37,370 WARNING Alignment A00627:685:HKMJFDSX7:1:1107:13856:10363_CTCTCTCTCTCT  403     TRINITY_DN1431_c0_g1_i1       795 has no mate -- skipped
2024-05-15 17:10:37,371 WARNING Alignment A00627:685:HKMJFDSX7:1:1107:13856:10363_CTCTCTCTCTCT  403     TRINITY_DN1431_c0_g1_i2       802 has no mate -- skipped
2024-05-15 17:10:37,371 WARNING Alignment A00627:685:HKMJFDSX7:1:1107:13856:10363_CTCTCTCTCTCT  403     TRINITY_DN16111_c0_g1_i1      25 has no mate -- skipped
2024-05-15 17:10:37,371 WARNING Alignment A00627:685:HKMJFDSX7:1:1107:13856:10363_CTCTCTCTCTCT  403     TRINITY_DN16111_c0_g1_i2      17 has no mate -- skipped
2024-05-15 17:10:37,371 WARNING Alignment A00627:685:HKMJFDSX7:1:1107:13856:10363_CTCTCTCTCTCT  403     TRINITY_DN16111_c0_g1_i2      25 has no mate -- skipped
2024-05-15 17:10:37,371 WARNING Alignment A00627:685:HKMJFDSX7:1:1107:13856:10363_CTCTCTCTCTCT  403     TRINITY_DN16111_c0_g1_i2      35 has no mate -- skipped
2024-05-15 17:10:37,372 WARNING Alignment A00627:685:HKMJFDSX7:1:1107:13856:10363_CTCTCTCTCTCT  403     TRINITY_DN47877_c0_g1_i1      322 has no mate -- skipped
2024-05-15 17:10:46,534 WARNING Alignment A00627:685:HKMJFDSX7:1:1122:2202:29575_AGTCATAATGCA   147     TRINITY_DN748_c0_g1_i8        163 has no mate -- skipped
2024-05-15 17:10:49,188 WARNING Alignment A00627:685:HKMJFDSX7:1:1126:10104:25160_CTCTCTCTCTCT  403     TRINITY_DN38476_c0_g1_i1      58 has no mate -- skipped
2024-05-15 17:10:49,188 WARNING Alignment A00627:685:HKMJFDSX7:1:1126:10104:25160_CTCTCTCTCTCT  403     TRINITY_DN38476_c0_g1_i1      60 has no mate -- skipped
2024-05-15 17:10:49,188 WARNING Alignment A00627:685:HKMJFDSX7:1:1126:10104:25160_CTCTCTCTCTCT  403     TRINITY_DN38476_c0_g1_i2      56 has no mate -- skipped
...

`2024-05-15 18:02:57,883 INFO Total pairs output: 21675865, Pairs skipped - no mates: 2489, Pairs skipped - not read1 or 2: 0`
`# job finished in 3144 seconds at Wed May 15 18:02:57 2024 -- 3135.76  4.84  0.00  0.00 -- 0f1a9ab9-2ed7-4081-ac2b-ae23e9feba6`

I really have no idea why I'm getting so many warnings since it seems you fixed this issue, so I'm suspecting it could be Bowtie2 specific or the fact that I'm using a transcriptome assembled using Trinity. But since I specified chimeric-pairs=discard --unpaired-reads=discard should this be happening?

Thank you so much in advance!

IanSudbery commented 6 months ago

No, I don't see why, with --chimeric-pairs=discard and --unpaired-reads=discard it shouldn't be happening. However, its only 2,489 pairs out of 21,675,865.

I'm also afraid that 3144 seconds (~50 minutes) isn't particularly a long time for this sort of thing.

IanSudbery commented 6 months ago

Would you be able to share a fragment on the name sorted BAM that included some of the reads that prepare-for-rsem is complaining about?

EvaMarkovic commented 6 months ago

Thank you for such a quick response!

Here is the BAM file: https://drive.google.com/file/d/1hYENNxKXbHOYvThBTkv5Jp-PdNrokDoX/view?usp=sharing

In regards to the speed of the process - I was using 32 cores (240 GB of memory) - could this explain it?

IanSudbery commented 6 months ago

@EvaMarkovic UMI-tools is single threaded, so the number of cores you give it shouldn't make any difference.

IanSudbery commented 6 months ago

I don't think that there is a problem.

Here is what happening: The reads being complained about here are all read2s (the flag is either 403 or 147). UMI-tools dedup decides which pairs to keep when it runs across read1 (using the information about the read2 encoded in the read1 alignment). If it decides to keep the pair, it makes a note of the read2 that read1 says is its mate and outputs that when it comes across it.

The cases here are all cases where read1 says that read2 is its mate, but read2 doesn't say that read1 is its mate, it records a different alignment as its mate.

E.g:

A00627:685:HKMJFDSX7:1:1122:2202:29575_AGTCATAATGCA     355     TRINITY_DN748_c0_g1_i8  164     0       67M     =       164     67      TGTTTTTTTTTTATGTACGGTGGCATGTTCTATATAAATCAATCAATAAGTTTTTTTTATTCTGCAA     :FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FFFF     AS:i:-5 XS:i:-5 XN:i:0  XM:i:5  XO:i:0  XG:i:0  NM:i:5  MD:Z:0G0T0G19T43T0      YS:i:-5 YT:Z:CP ZW:f:0
A00627:685:HKMJFDSX7:1:1122:2202:29575_AGTCATAATGCA     147     TRINITY_DN748_c0_g1_i8  164     0       67M     =       21      -210    TGTTTTTTTTTTATGTACGGTGGCATGTTCTATATAAATCAATCAATAAGTTTTTTTTATTCTGCAA     FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF     AS:i:-5 XS:i:-5 XN:i:0  XM:i:5  XO:i:0  XG:i:0  NM:i:5  MD:Z:0G0T0G19T43T0      YS:i:-5 YT:Z:CP ZW:f:0

The first read here is the forward alignment of a pair, aligned at base 164 on transcript TRINITY_DN748_c0_g1_i8. It records that its mate is located at base 164 on the same transcript. However, the reverse alignment at 164 on TRINITY_DN748_c0_g1_i8 says that its mate is at base 21 on that transcript, not base 164 - but not such read exists in the file (dedup will have output the second read as the mate of the first read without looking to see if the mate relationship is reciprocal, because many aligners do not guarantee to output reciprocal mate relationships).

This isn't a problem though - when prepare-for-rsem encounters the first read, it will output the second as its mate. When it encounters the second, it can't find its mate, but it doesn't matter, because it has already been output. I'll see if I can think of a way to only raise this warning if it is a genuine problem.

EvaMarkovic commented 6 months ago

Thank you so much for this explanation, your time, and such a quick response!!!

I really appreciate it, the logic behind how umi-tools works makes total sense, and I'm glad that it's a harmless warning.