CGATOxford / UMI-tools

Tools for handling Unique Molecular Identifiers in NGS data sets
MIT License
485 stars 190 forks source link

error raised in group command #130

Closed jsolvason closed 7 years ago

jsolvason commented 7 years ago

It may be the way that I prepped the .bam file... but I am receiving this error when executing the group command:

umi_tools group -I in.srt.indx.bam --paired --group-out="out.tsv" --output-bam -S "out.bam"

Traceback (most recent call last):
  File "/Users/josephsolvason/miniconda3/bin/umi_tools", line 11, in <module>
    sys.exit(main())
  File "/Users/josephsolvason/miniconda3/lib/python3.6/site-packages/umi_tools/umi_tools.py", line 46, in main
    module.main(sys.argv)
  File "/Users/josephsolvason/miniconda3/lib/python3.6/site-packages/umi_tools/group.py", line 520, in main
    outfile.write(read, unique_id, top_umi)
TypeError: write() takes exactly one argument (3 given)

Cliffnotes on the data prep up to this point...

  1. cutadapt (quality + adapter trim)
  2. umi_tools extract
  3. cutadapt (removes constant region of a mutagenized library)
  4. bowtie2 (aligns to reference library)
  5. samtools (filter => keep only the alignments that have a mate (FLAG=1))
  6. samtools (sort and index)
  7. umi_tools group

Please let me know if you have any thoughts on this issue. I'd love to get some feedback.

You guys are awesome!

Joe

IanSudbery commented 7 years ago

I think I can see the problem here, but I need to check with @TomSmithCGAT who wrote the group code.

Tom: Should is this call to TwoPassWriter.write rather than AlignementFile.write left over from before you refactored group?

TomSmithCGAT commented 7 years ago

Correct. Sorry, this is a bug from the refactoring. Paired end reads are written out immediately so we don't use TwoPassWriter any more. We need to make the following change. I'll do this on a branch now

Current:

for read in reads:
    if outfile:
        if options.paired:
            # if paired, we need to supply the tags to
            # add to the paired read
            outfile.write(read, unique_id, top_umi)

        else:
            # Add the 'UG' tag to the read
            read.tags += [('UG', unique_id)]
            read.tags += [(options.umi_group_tag, top_umi)]
            outfile.write(read)

Patch:

for read in reads:
      # Add the 'UG' tag to the read
      read.tags += [('UG', unique_id)]
      read.tags += [(options.umi_group_tag, top_umi)]
      outfile.write(read)
TomSmithCGAT commented 7 years ago

Hi @jsolvason. I've pushed the changes above to the {TS}-Group_patch_write branch. Would you mind installing from this branch to check this resolves your error. I can give you instructions for how to install from the branch if you would like?

jsolvason commented 7 years ago

Thanks for the swift response! Yes let me know how to install the branch and I'll try it out.

On Fri, Jun 2, 2017 at 2:42 AM Tom Smith notifications@github.com wrote:

Hi @jsolvason https://github.com/jsolvason. I've pushed the changes above to the {TS}-Group_patch_write branch. Would you mind installing from this branch to check this resolves your error. I can give you instructions for how to install from the branch if you would like?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/130#issuecomment-305740517, or mute the thread https://github.com/notifications/unsubscribe-auth/AbxuJm4b5Y5ZRk6z2lu2HbQuH_k2TuZkks5r_9jrgaJpZM4Ntt_V .

--

Joseph Solvason Chimera Bioengineering

TomSmithCGAT commented 7 years ago

If you've got git installed, you can clone this directory:

git clone git@github.com:CGATOxford/UMI-tools.git

checkout the branch:

git fetch
git checkout {TS}-Group_patch_write

and then install with:

python setup.py install

If you don't have git, you should be able to install with pip using:

pip install git+https://github.com/CGATOxford/UMI-tools.git@{TS}-Group_patch_write
jsolvason commented 7 years ago

Not sure if this is common but I had to uninstall umi_tools in order to install the branch. I am running the program now with no error raised so far. I'll let you know how it goes!

Thanks guys,

Joe

jsolvason commented 7 years ago

Side question... How long (ball park estimate/range) do you expect the the typical dedup or group command to take?

IanSudbery commented 7 years ago

We have benchmark iCLIP (5nt UMIs, very high duplication levels ) and single cell (10nt UMIs, less highly duplicated) that take about 1.5 minutes and 2 minutes per chromosome respectively.

We do keep collecting pathological cases from people that take much longer though. Its it taking loads longer for you? We do have a speed improvement in the works.

On Sat, 3 Jun 2017, 6:31 pm jsolvason, notifications@github.com wrote:

Side question... How long (ball park estimate/range) do you expect the the typical dedup or group command to take?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/130#issuecomment-305989622, or mute the thread https://github.com/notifications/unsubscribe-auth/AFJFjiercG6KoCNs9-Id_IlT5KFL83iHks5sAZhbgaJpZM4Ntt_V .

jsolvason commented 7 years ago

I suspect something is going wrong. It will take on the order of a day and many times when I come back to the computer it will say "Killed: 9". I'm not sure what may be different or odd about this data set that is making it take so long.

I performed umi analysis on the exercise data successfully, so I do not suspect it is an issue with the way it is installed or my computer.

IanSudbery commented 7 years ago

Yeah, definitely something funny going on.

The best way to troubleshoot this would be for us to have a look at some of the data. I don't know if this is something you would consider? The data could be anonymised if that helps (replace sequence and qualities with *, add a number (that only you know) to the mapping positions.).

We can profile the code running on your data and see what the issue it.

jsolvason commented 7 years ago

Yes, definitely! Thank you for the help. What is the best way to send you the data? I'd prefer to do it over email or some other direct communication.

IanSudbery commented 7 years ago

I have looked at your file.

It contians around 1.9 million reads, which correspond to 500k unique UMIs. 490k of these are mapped to a single location on a single contig. (thats almost 50% of the total possible UMI space for 10mer UMIs).

In order to do the network building the algorithm has to compare every UMI at one position with every other UMI at that position. This is 2.4*10^11 comparisons. This is why the dedup takes so long. The good news is that we are working on an improvement to the algorithm that might take this down to ~1 million comparisons.

The bad news is that a network that large doesn't just take a lot of time to compute, it takes a lot of memory to store. A full saturated 10nt network needs to store 1,048,576 nodes with 40 neighbours each. So the speed increase just means that on a 16GB laptop/desktop, you are just going to run out of memory quicker!

I'm currently running your file on a 100GB cluster session to see how much memory it actually uses.

IanSudbery commented 7 years ago

Hmmm ... it ran out of memory even with 100GB. Although I'm not quite sure why. My calculations suggest that it should not need that much memory. Unless storing the reads themselves is taking a lot of memory....

jsolvason commented 7 years ago

You all have been such a great help! I was in the middle of writing this email when I got the email that the run had failed.

If the reads were to map to more contigs, would that require less space? This was a test run of a single member library to validate our sequencing pipeline and data analysis. The subsequent analyses will be with full libraries.

Any other suggestions? When do you think the improved algorithm might be finished?

Best,

Joe

Joseph Solvason Chimera Bioengineering

On Tue, Jun 6, 2017 at 7:06 AM, Ian Sudbery notifications@github.com wrote:

I have looked at your file.

It contians around 1.9 million reads, which correspond to 500k unique UMIs. 490k of these are mapped to a single location on a single contig. (thats almost 50% of the total possible UMI space for 10mer UMIs).

In order to do the network building the algorithm has to compare every UMI at one position with every other UMI at that position. This is 2.4*10^11 comparisons. This is why the dedup takes so long. The good news is that we are working on an improvement to the algorithm that might take this down to ~1 million comparisons.

The bad news is that a network that large doesn't just take a lot of time to compute, it takes a lot of memory to store. A full saturated 10nt network needs to store 1,048,576 nodes with 40 neighbours each. So the speed increase just means that on a 16GB laptop/desktop, you are just going to run out of memory quicker!

I'm currently running your file on a 100GB cluster session to see how much memory it actually uses.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/130#issuecomment-306496837, or mute the thread https://github.com/notifications/unsubscribe-auth/AbxuJt0V97fVY4_lzjG7UAbUsguyLKOhks5sBVzJgaJpZM4Ntt_V .

IanSudbery commented 7 years ago

Absolutely, if those same reads were split over all of the contigs you have in that file, the run would be very quick. For example, the you have a contig in there with 1,700 reads. It took 12 seconds and used so little memory I couldn't measure it.

jsolvason commented 7 years ago

Ah so umi_tools only makes comparisons of UMIs in reads that map to the same contig?

Therefore, If we had redundant UMIs (say, AAAA) in three individual reads (AAAA-read1, AAAA-read2, AAAA-read3) that map to three different contigs (contig1, contig2, contig3), umi_group would ouput 3 groups even though the reads had the same umi?

Joe

Joseph Solvason Chimera Bioengineering

On Wed, Jun 7, 2017 at 1:21 AM, Ian Sudbery notifications@github.com wrote:

Absolutely, if those same reads were split over all of the contigs you have in that file, the run would be very quick. For example, the you have a contig in there with 1,700 reads. It took 12 seconds and used so little memory I couldn't measure it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/130#issuecomment-306725154, or mute the thread https://github.com/notifications/unsubscribe-auth/AbxuJugf2LkPgDpvfRNO_t1BMGHMm8Iqks5sBl2KgaJpZM4Ntt_V .

IanSudbery commented 7 years ago

Infact, unless --per-contig or --per-gene were specified in the commandline, reads that mapped to different bases within the same contig would be assigned to different groups even if they had the same UMI.

Doing an overall group/dedup is not something we currently offer. However, something like this might be offered in the framework that @TomSmithCGAT is working on for demultiplexing reads from different cells in single cell seq.

TomSmithCGAT commented 7 years ago

@jsolvason - When you say 'single member library ', do you mean the library preparation was done from a sample with only one sequence? What will the 'full libraries' contain?

jsolvason commented 7 years ago

The full libraries will contain a total of 7,776 or 15,552 members

jsolvason commented 7 years ago

Hello again,

The full 7776 member library has been sequenced and I'm analyzing it not with umi_tools dedup. It has taken 1.5 hours so far.

Might you have a prediction for run time if there are 7776 members in the library, 7776 reference sequences, 10nt UMIs, and 1M reads?

Best,

Joe