Closed jsolvason closed 7 years ago
I think I can see the problem here, but I need to check with @TomSmithCGAT who wrote the group code.
Tom: Should is this call to TwoPassWriter.write rather than AlignementFile.write left over from before you refactored group?
Correct. Sorry, this is a bug from the refactoring. Paired end reads are written out immediately so we don't use TwoPassWriter any more. We need to make the following change. I'll do this on a branch now
Current:
for read in reads:
if outfile:
if options.paired:
# if paired, we need to supply the tags to
# add to the paired read
outfile.write(read, unique_id, top_umi)
else:
# Add the 'UG' tag to the read
read.tags += [('UG', unique_id)]
read.tags += [(options.umi_group_tag, top_umi)]
outfile.write(read)
Patch:
for read in reads:
# Add the 'UG' tag to the read
read.tags += [('UG', unique_id)]
read.tags += [(options.umi_group_tag, top_umi)]
outfile.write(read)
Hi @jsolvason. I've pushed the changes above to the {TS}-Group_patch_write branch. Would you mind installing from this branch to check this resolves your error. I can give you instructions for how to install from the branch if you would like?
Thanks for the swift response! Yes let me know how to install the branch and I'll try it out.
On Fri, Jun 2, 2017 at 2:42 AM Tom Smith notifications@github.com wrote:
Hi @jsolvason https://github.com/jsolvason. I've pushed the changes above to the {TS}-Group_patch_write branch. Would you mind installing from this branch to check this resolves your error. I can give you instructions for how to install from the branch if you would like?
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/130#issuecomment-305740517, or mute the thread https://github.com/notifications/unsubscribe-auth/AbxuJm4b5Y5ZRk6z2lu2HbQuH_k2TuZkks5r_9jrgaJpZM4Ntt_V .
--
Joseph Solvason Chimera Bioengineering
If you've got git installed, you can clone this directory:
git clone git@github.com:CGATOxford/UMI-tools.git
checkout the branch:
git fetch
git checkout {TS}-Group_patch_write
and then install with:
python setup.py install
If you don't have git, you should be able to install with pip using:
pip install git+https://github.com/CGATOxford/UMI-tools.git@{TS}-Group_patch_write
Not sure if this is common but I had to uninstall umi_tools in order to install the branch. I am running the program now with no error raised so far. I'll let you know how it goes!
Thanks guys,
Joe
Side question... How long (ball park estimate/range) do you expect the the typical dedup or group command to take?
We have benchmark iCLIP (5nt UMIs, very high duplication levels ) and single cell (10nt UMIs, less highly duplicated) that take about 1.5 minutes and 2 minutes per chromosome respectively.
We do keep collecting pathological cases from people that take much longer though. Its it taking loads longer for you? We do have a speed improvement in the works.
On Sat, 3 Jun 2017, 6:31 pm jsolvason, notifications@github.com wrote:
Side question... How long (ball park estimate/range) do you expect the the typical dedup or group command to take?
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/130#issuecomment-305989622, or mute the thread https://github.com/notifications/unsubscribe-auth/AFJFjiercG6KoCNs9-Id_IlT5KFL83iHks5sAZhbgaJpZM4Ntt_V .
I suspect something is going wrong. It will take on the order of a day and many times when I come back to the computer it will say "Killed: 9". I'm not sure what may be different or odd about this data set that is making it take so long.
I performed umi analysis on the exercise data successfully, so I do not suspect it is an issue with the way it is installed or my computer.
Yeah, definitely something funny going on.
The best way to troubleshoot this would be for us to have a look at some of the data. I don't know if this is something you would consider? The data could be anonymised if that helps (replace sequence and qualities with *, add a number (that only you know) to the mapping positions.).
We can profile the code running on your data and see what the issue it.
Yes, definitely! Thank you for the help. What is the best way to send you the data? I'd prefer to do it over email or some other direct communication.
I have looked at your file.
It contians around 1.9 million reads, which correspond to 500k unique UMIs. 490k of these are mapped to a single location on a single contig. (thats almost 50% of the total possible UMI space for 10mer UMIs).
In order to do the network building the algorithm has to compare every UMI at one position with every other UMI at that position. This is 2.4*10^11 comparisons. This is why the dedup takes so long. The good news is that we are working on an improvement to the algorithm that might take this down to ~1 million comparisons.
The bad news is that a network that large doesn't just take a lot of time to compute, it takes a lot of memory to store. A full saturated 10nt network needs to store 1,048,576 nodes with 40 neighbours each. So the speed increase just means that on a 16GB laptop/desktop, you are just going to run out of memory quicker!
I'm currently running your file on a 100GB cluster session to see how much memory it actually uses.
Hmmm ... it ran out of memory even with 100GB. Although I'm not quite sure why. My calculations suggest that it should not need that much memory. Unless storing the reads themselves is taking a lot of memory....
You all have been such a great help! I was in the middle of writing this email when I got the email that the run had failed.
If the reads were to map to more contigs, would that require less space? This was a test run of a single member library to validate our sequencing pipeline and data analysis. The subsequent analyses will be with full libraries.
Any other suggestions? When do you think the improved algorithm might be finished?
Best,
Joe
Joseph Solvason Chimera Bioengineering
On Tue, Jun 6, 2017 at 7:06 AM, Ian Sudbery notifications@github.com wrote:
I have looked at your file.
It contians around 1.9 million reads, which correspond to 500k unique UMIs. 490k of these are mapped to a single location on a single contig. (thats almost 50% of the total possible UMI space for 10mer UMIs).
In order to do the network building the algorithm has to compare every UMI at one position with every other UMI at that position. This is 2.4*10^11 comparisons. This is why the dedup takes so long. The good news is that we are working on an improvement to the algorithm that might take this down to ~1 million comparisons.
The bad news is that a network that large doesn't just take a lot of time to compute, it takes a lot of memory to store. A full saturated 10nt network needs to store 1,048,576 nodes with 40 neighbours each. So the speed increase just means that on a 16GB laptop/desktop, you are just going to run out of memory quicker!
I'm currently running your file on a 100GB cluster session to see how much memory it actually uses.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/130#issuecomment-306496837, or mute the thread https://github.com/notifications/unsubscribe-auth/AbxuJt0V97fVY4_lzjG7UAbUsguyLKOhks5sBVzJgaJpZM4Ntt_V .
Absolutely, if those same reads were split over all of the contigs you have in that file, the run would be very quick. For example, the you have a contig in there with 1,700 reads. It took 12 seconds and used so little memory I couldn't measure it.
Ah so umi_tools only makes comparisons of UMIs in reads that map to the same contig?
Therefore, If we had redundant UMIs (say, AAAA) in three individual reads (AAAA-read1, AAAA-read2, AAAA-read3) that map to three different contigs (contig1, contig2, contig3), umi_group would ouput 3 groups even though the reads had the same umi?
Joe
Joseph Solvason Chimera Bioengineering
On Wed, Jun 7, 2017 at 1:21 AM, Ian Sudbery notifications@github.com wrote:
Absolutely, if those same reads were split over all of the contigs you have in that file, the run would be very quick. For example, the you have a contig in there with 1,700 reads. It took 12 seconds and used so little memory I couldn't measure it.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CGATOxford/UMI-tools/issues/130#issuecomment-306725154, or mute the thread https://github.com/notifications/unsubscribe-auth/AbxuJugf2LkPgDpvfRNO_t1BMGHMm8Iqks5sBl2KgaJpZM4Ntt_V .
Infact, unless --per-contig or --per-gene were specified in the commandline, reads that mapped to different bases within the same contig would be assigned to different groups even if they had the same UMI.
Doing an overall group/dedup is not something we currently offer. However, something like this might be offered in the framework that @TomSmithCGAT is working on for demultiplexing reads from different cells in single cell seq.
@jsolvason - When you say 'single member library ', do you mean the library preparation was done from a sample with only one sequence? What will the 'full libraries' contain?
The full libraries will contain a total of 7,776 or 15,552 members
Hello again,
The full 7776 member library has been sequenced and I'm analyzing it not with umi_tools dedup. It has taken 1.5 hours so far.
Might you have a prediction for run time if there are 7776 members in the library, 7776 reference sequences, 10nt UMIs, and 1M reads?
Best,
Joe
It may be the way that I prepped the .bam file... but I am receiving this error when executing the group command:
umi_tools group -I in.srt.indx.bam --paired --group-out="out.tsv" --output-bam -S "out.bam"
Cliffnotes on the data prep up to this point...
Please let me know if you have any thoughts on this issue. I'd love to get some feedback.
You guys are awesome!
Joe