Tigmint performance on large genome

bcgsc / tigmint

⛓ Correct misassemblies using linked AND long reads

https://bcgsc.github.io/tigmint/

GNU General Public License v3.0

54 stars 13 forks source link

Tigmint performance on large genome #33

Closed andreaswallberg closed 4 years ago

andreaswallberg commented 4 years ago

Dear developers,

I am trying to analyze a very large (genome size=18Gbp) and fragmented assembly (~500k contigs, N50=50kbp, assembly size=22Gbp, due to excess of haplotigs) with tigmint.

We have four libraries of 10x chromium linked-reads (from a single individual), totaling ~30x coverage across this genome, and have preprocessed the data with longranger and tigmint-molecule. I ran tigmint-cut on this data on 14 cores for one week with full CPU usage for a week but it never printed anything to disk or the screen beyond the "Finding breakpoints..." message.

Admittedly, this use case and assembly is somewhat extreme but I'd still appreciate some feedback. Is it possible that the program might have stalled due to some technical reason (e.g. waiting for a missing dependency program or expected output)? To you knowledge, has it been used for very large and/or repetitive genomes before? If so, how did it perform?

It is possible to add debug messages that may reveal a performance bottleneck or particular step in the process that may need some tweaking? Any other constructive tips?

lcoombe commented 4 years ago

Hi @andreaswallberg,

I have run Tigmint on conifer (~20Gbp) genome assemblies of similar contiguity in the past, so we know that Tigmint does scale to large and repetitive genomes. I find that the tigmint-cut stage is relatively fast (taking ~2-3h), so I wonder if something has perhaps gone wrong in the pipeline.

First of all, what version of Tigmint are you using? Did you concatenate all the linked reads from the different libraries together into a single file, or did you run the alignments in parallel? In addition, I will update the read group of reads from different libraries so that Tigmint recognizes them as unique (ie. leave library 1 as is, change library 2 to BX:Z:<barcode>-2, etc.

Are you sure that the tigmint-molecule BED file is sorted properly? (ie. by chromosome, then start, then end) I've seen tigmint-cut stall or be quite slow in the past when that sorting wasn't correct.

Thank your for your interest in Tigmint! Lauren

andreaswallberg commented 4 years ago

Hi @lcoombe ,

Thanks for the feedback. After I annotated the barcodes (-1, -2, -3, -4) I concatenated everything together and ran tigmint-molecule and tigmint-cut on everything.

Actually, you were right about sorting the BED output from tigmint-molecule. I had not done it properly and after reading the manual and sorting it exactly as specified, the job completed in 2hrs16min, with a nice mix of split an non-split contigs in the BED file.

You can close the issue.

lcoombe commented 4 years ago

Great - I'm glad you got it working!