CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
481 stars 190 forks source link

Problems with UMI dedup and time when aligned against transcriptome #539

Closed lcabus-flomics closed 2 years ago

lcabus-flomics commented 2 years ago

Hello,

I'm running a pipeline that uses UMI dedup. However, we see a that when deduplicating the alignment against the transcriptome we see that it takes a huge amount of time (it can go up to more than 24h). We did not see this problem when aligning against the genome. We also have seen that the program reads a huge amount of data (around 4Tb), which we also didn't see when deduplicating the alignment against the genome, when the program read around 30Gb of data.

In our case we have a set of 12 UMIs at the beginning of the first read (we extracted the UMI using this: --bc-pattern=NNNNNNNNNNNN), the Fastq files have 15M reads, paired-end, we align with STAR and the resulting bam files weight around 400Mb (we align around 50%).

We did the libraries with very low concentrations of RNA and we did multiple PCR cycles, so we would expect the samples to be very duplicated, which I suppose could increase the time.

Which parameters could I change to improve the speed of the process?

Thank you very much, Lluc

IanSudbery commented 2 years ago

Thats very odd. Can you let me know the full command lines you used for both the genome and transcriptome based dedup?

lcabus-flomics commented 2 years ago

Both use the same command lines, which are very standard: umi_tools dedup -I $bam -S ${prefix}.bam --paired.

IanSudbery commented 2 years ago

I wonder if its something to do with pairing. Try setting --chimeric-pairs=discard

There are three main things that take time/memory in UMItools:

  1. Single locations with a very large number of UMIs. Once you get past about 30% UMI saturation at a single location, lots of the assumptions about the algorithm break down. But with 15M reads and 12bp UMIs, I would not expect this to be the problem. And even if it were, I would not expect it to differ between alignment to genome and alignment to transcriptome.
  2. Generating stats (which you are not doing).
  3. Storing a large number of reads in memory while their mate is found. Alignment to the transcriptome might be thought to produce a lot of chimeric reads (reads where each mate aligns to different contigs).
lcabus-flomics commented 2 years ago

Hi Ian,

I have tried, but it doesn't seem to improve the performance too much. I have added --chimeric-pairs=discard --unpaired-reads=discard just in case, but I launched hours ago and it is still processing.

IanSudbery commented 2 years ago

What happens if you run without --paired? This will produce an invalid output, but it might tell us if the pairing is the problem.

lcabus-flomics commented 2 years ago

Okay, I will try

lcabus-flomics commented 2 years ago

Okay, it worked flawlessly! I don't understand why this is working. I put 2 fastq files into STAR to do the alignment, shouldn't I put --paired?

IanSudbery commented 2 years ago

Yes. For a valid output, you need to put --paired.

When UMI-tools encounters a read1 it wants to keep, it adds its details to a buffer. When it encounters a read2 it checks if it is in the buffer and outputs it. If it gets to the end of the contig without finding the read2, it goes back to the beginning of the contig and scans it again to catch read2s that were 5' of the read1.

All this buffer stores is the read name, contig and location, so its not much information, but the buffer must be getting so full that it is using all your memory. That shouldn't happen if you have "--chimeric-pairs=discard" as the buffer should be empty after its finished dealing with a particular contig (as it doesn't allow reads where the mate is on a different contig). But it seems that somehow it is.

Perhaps you have one contig that is really, really heavily dense, and pairs tend to be a long way away? Is it possible you have a transcript sequence in your transcript reference that is masked out in the genome reference? Like repeat sequences or rRNA sequences?

It depends how deep you want to go on this. I can suggest several ways forward:

lcabus-flomics commented 2 years ago

I really want to go deep into this, since all of the samples from my project are like that. I will try to use the genome alignment, since it seems the best way to deal with this problem, and I will get back to you. If you want, I can send you the bam files in case you want to inspect them and figure out what is the problem.

lcabus-flomics commented 2 years ago

Hi Ian,

I'm using the mudskipper to convert the genome alignment into the transcriptome alignment. However, this is very difficult to implement, since we are using a public pipeline that is constantly being updated and it is difficult to add this program in the middle.

I am analyzing which could be the problems and if I find out which is the problem I will let you know. In the meantime I will leave here an example of a bam file against the transcriptome that is giving issues and against the genome which is working well, in case you can analyze it.

https://flomics-public.s3.eu-west-1.amazonaws.com/lluc/umitools_issue/test_sample_5M.Aligned.toGenome.sorted.bam https://flomics-public.s3.eu-west-1.amazonaws.com/lluc/umitools_issue/test_sample_5M.Aligned.toGenome.sorted.bam.bai https://flomics-public.s3.eu-west-1.amazonaws.com/lluc/umitools_issue/test_sample_5M.Aligned.toTranscriptome.sorted.bam https://flomics-public.s3.eu-west-1.amazonaws.com/lluc/umitools_issue/test_sample_5M.Aligned.toTranscriptome.sorted.bam.bai

Thank you very much! Lluc

julienlag commented 2 years ago

Dear Ian, colleague of @lcabus-flomics here. Thank you very much for your help! We're still running tests here to troubleshoot the problem. One thing we've noticed with one of the problematic BAM files is that no contig is particularly complex -- the most "read-rich" contig/transcript contains ~2,500 reads only, and it's processed smoothly by umi_tools in a few seconds. Could it be that the long running time is just due the huge number of contigs we have in the transcriptome BAM file (>140,000)? The issue is that some of our samples contain a high proportion of contaminating genomic reads, which leads to most annotated transcripts being covered by at least a few reads -- hence the large number of contigs/transcripts in the input BAM file. My hypothesis is that loading each contig into memory is computationally costly for umi_tools due to some kind of overhead, so the whole process becomes very slow (but linearly so) when fed with a very large number of contigs. I hope I'm being clear. What do you think?

IanSudbery commented 2 years ago

I think you are almost there. I think it is the large number of contigs. I ran the files shared with me here. The genome aligned file took 10 minutes, and used 1GB of RAM, the transcriptome aligned file took 5 hours, and used 400MB of RAM. I'll profile the run today, but my guess is that the extra time is due, not to loading the contig into memory (which shouldn't happen), but rather, its the act of going backwards in the file to find the start of the contig when we look for mates.

However, I'm not seeing it take >24hrs. I had originally misread your issue and thought that you said it was using 4TB of memory, not reading 4TB of data from the disk.

Can I just check which version you are using?

Message ID: @.***>

julienlag commented 2 years ago

thanks Ian, looking forward to your profiling results! for my tests i've used v1.0.1 (latest from bioconda) in prod we use the one included in the latest version of nf-core/rnaseq, which we don't really have control over, as far as I understand

IanSudbery commented 2 years ago

Hi Julien

The latest version on bioconda should be v1.1.2 (https://anaconda.org/bioconda/umi_tools). v1.1.0 included a change that might be relevant here. I wonder what is keeping conda from installing that version?

julienlag commented 2 years ago

Hi Ian, we tried to update and reinstall the conda package and whatever we do, we're stuck with umi_tools-1.0.1. I really have no idea what prevents conda from installing the latest version

julienlag commented 2 years ago

also tried explicitly specifying the version and detected some system incompatibility:

conda create --name umi-tools-TEST_env -c bioconda umi_tools=1.1.2

Collecting package metadata (current_repodata.json): done
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: - 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed                                                                                                                

UnsatisfiableError: The following specifications were found to be incompatible with your system:

- feature:/linux-64::__glibc==2.31=0
- umi_tools=1.1.2 -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']

Your installed version is: 2.31

Using Ubuntu 20.04.4 LTS

IanSudbery commented 2 years ago

Hi Julien,

There is no reason that UMI-tools should need a particular version of libgcc - this is imposed by bioconda who build all their packages against a specific glibc version. I think there is a way around it, but I can't remember, you'll have to ask over on the bioconda support. As an aside, I think this means you won't be able to install anything from the latest build of bioconda.

Alternatively you should be able to download the code from github and run python setup.py install as long as all the dependencies are installed. Or you could try using pip.

julienlag commented 2 years ago

thanks Ian for your help In any case @lcabus-flomics has confirmed to me that our production environment (nf-core/rnaseq) uses version 1.1.2, which means the version evidently does not make much difference in terms of running time...

julienlag commented 2 years ago

Hi @IanSudbery we're looking into a solution around this issue. You mentioned mudskipper, which we tried, however we quickly encountered this bug. It looks to us like mudskipper is not ready for production use. What's your take on this?

Another, more quick-and-dirty option, would be to:

  1. Produce both genome.bam and transcriptome.bam with STAR
  2. Run umi_tools dedup on genome.bam, output to genome.dedup.bam
  3. Extract the list of deduplicated reads IDs from genome.dedup.bam , something like: samtools view genome.dedup.bam | cut -f1 | sort | uniq > genome.dedup.readIDs.txt
  4. Use this list of read IDs to select corresponding records in transcriptome.bam , as in: samtools view transcriptome.bam | grep -F -f genome.dedup.readIDs.txt > transcriptome.dedup.sam (grep may be too slow here, we may want to use join instead, but you get the idea)

What do you think of this approach? I'm pretty sure there's a flaw in there, especially when dealing with multi-mapped reads...

IanSudbery commented 2 years ago

Okay, I've a couple of things to say.

My profiling revealed that indeed it was the moving backward in the file to rescan each contig that was slowing the processing down. This was because, in order to not lose its place, pysam opens a new handle on the file every time it does this. I have managed to change the code to avoid this, and the run time for transcriptome aligned data is now close to that for genome aligned data. You can find/try the code on the #543 PR.

However, since you are using umi_tools in production, its going to take some time for this to find it's way into a release that nf-core will use.

Looking at your plan above, I think it is the way to go. Not only because of the speed, but also because of multiple mapping. When you align to the transcriptome you will get a read aligning to many different transcripts. Each alignment position will be treated independently. Thus, it is possible that a given read will be chosen on one transcript, but not another. However, this seems wrong. A read is either a duplicate or it is not. Thus deduplicating on the genome is definitely superior to deduplicating on the transcriptome. Your approach gives you the best of both worlds.

I have a suggestion for doing it quicker though. If you name sort your transcriptome.bam and also sort your genome.dedup.readIDs.txt, then the following should be faster:

import pysam
names_to_keep = open("genome.dedup.readIDs.txt")
transcriptome_bam = pysam.AlignmentFile("transcriptome.bam")
out_bam = pysam.AlignmentFile("transcriptome.dedup.bam", "w", template=transcriptome_bam)
current_name = names_to_keep.next()
current_read = transcriptome_bam.next()

while current_name and current_read:

    if current_read.query_name == current_name.strip():
        out_bam.write(current_read)
        current_name = names_to_keep.next()

    try:
        current_read = transcriptome_bam.next()
    except StopIteration:
        break

out_bam.close()
julienlag commented 2 years ago

sounds great Ian, thank you so much for your remarkable responsiveness. We'll see if we can implement this workaround until nf-core includes your fix in a release