combogenomics / medusa

A draft genome scaffolder that uses multiple reference genomes in a graph-based approach.
http://combo.dbe.unifi.it/medusa/
GNU General Public License v3.0
42 stars 15 forks source link

Building the network... forever #9

Closed carlos88morais closed 7 years ago

carlos88morais commented 7 years ago

Hi,

I am running medusa for scaffolding 15 genomes, sizes from 5Mb to 12Mb. It went fine for 14 of them, taking a few hours each, but there is one stuck in the "Building the network..." phase - it's been running for 555 hours by now, with low memory and cpu usage.

Is there any way to check progress, more detailed than the normal output?

Regards, Carlos

EBosi commented 7 years ago

Dear Carlos, we are aware of that issue. Can you provide some information about the assembly you are working on?

Il 1 mar 2017 23:02, "carlos88morais" notifications@github.com ha scritto:

Hi,

I am running medusa for scaffolding 15 genomes, sizes from 5Mb to 12Mb. It went fine for 14 of them, taking a few hours each, but there is one stuck in the "Building the network..." phase - it's been running for 555 hours by now, with low memory and cpu usage.

Is there any way to check progress, more detailed than the normal output?

Regards, Carlos

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/combogenomics/medusa/issues/9, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQhmR4kh0IO9GAChy8SiRZEBKl9bd0yks5rherggaJpZM4MQQxh .

carlos88morais commented 7 years ago

Dear Emanuele,

The data is from Methylobaterium populi, strain TC3-6, sequenced with Illumina MiSeq. The assembly is a combination of the result from many assemblers, using Metassembler. The references are 10 of the best hits from an NCBI blast. The other 14 assemblies, also Methylobacterium, different strains, same process, finished in a few hours.

Regards, Carlos

EBosi commented 7 years ago

Hi Carlos, I just need to know how many contigs, n50 etc.

Il 2 mar 2017 15:44, "carlos88morais" notifications@github.com ha scritto:

Dear Emanuele,

The data is from Methylobaterium populi, strain TC3-6, sequenced with Illumina MiSeq. The assembly is a combination of the result from many assemblers, using Metassembler https://sourceforge.net/projects/metassembler/. The references are 10 of the best hits from an NCBI blast. The other 14 assemblies, also Methylobacterium, different strains, same process, finished in a few hours.

Regards, Carlos

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/combogenomics/medusa/issues/9#issuecomment-283671933, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQhmd-5Ltirtk0s1g29RT0mK6I8MHikks5rhtXOgaJpZM4MQQxh .

carlos88morais commented 7 years ago

# contigs 889 # contigs (>= 0 bp) 889 # contigs (>= 1000 bp) 888 # contigs (>= 5000 bp) 583 # contigs (>= 10000 bp) 350 # contigs (>= 25000 bp) 117 # contigs (>= 50000 bp) 35 Largest contig 236565 Total length 11696099 Total length (>= 0 bp) 11696099 Total length (>= 1000 bp) 11695439 Total length (>= 5000 bp) 10787615 Total length (>= 10000 bp) 9107237 Total length (>= 25000 bp) 5347029 Total length (>= 50000 bp) 2669255 N50 22810 N75 11212 L50 138 L75 319 GC (%) 70.010 Mismatches
# N's 1935 # N's per 100 kbp 16.54

EBosi commented 7 years ago

Ok the problem we have is that the algorithm scales badly with the number of contigs. You might wanna either obtain a better assembly or filter out short contigs before using medusa... let me know if that works out.

Il 3 mar 2017 11:56, "carlos88morais" notifications@github.com ha scritto:

contigs 889 contigs (>= 0 bp) 889 contigs (>= 1000 bp) 888 contigs (>= 5000 bp) 583 contigs (>= 10000 bp) 350 contigs (>= 25000 bp) 117 contigs (>= 50000 bp) 35

Largest contig 236565 Total length 11696099 Total length (>= 0 bp) 11696099 Total length (>= 1000 bp) 11695439 Total length (>= 5000 bp) 10787615 Total length (>= 10000 bp) 9107237 Total length (>= 25000 bp) 5347029 Total length (>= 50000 bp) 2669255 N50 22810 N75 11212 L50 138 L75 319 GC (%) 70.010 Mismatches N's 1935 N's per 100 kbp 16.54

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/combogenomics/medusa/issues/9#issuecomment-283925552, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQhmS-0702xyGRx5yCjWr9LNSyuvkClks5rh_HbgaJpZM4MQQxh .

carlos88morais commented 7 years ago

Dear Emanuele,

As a test, I just deleted the smaller contigs until my assembly had only 700 - because there are 2 other genomes with more than 750 contigs which ran ok. It is running since friday, also stuck in building the network. That makes me wander: Is it just the number of contigs which scales badly? Or maybe the genome size also have a considerable effect on processing time? Even with only 700 contigs, the assembly still have around 11M bases.

Best Regards, Carlos Morais

EBosi commented 7 years ago

Dear Carlos, I can see it being a bit tedious but, at the moment, there's no easy workaround.

Is it just the number of contigs which scales badly? Or maybe the genome size also have a considerable effect on processing time?

More than genome size, it is network size. Genomes with a large number of contigs correspond to dense networks which take more time to be analysed. However, there is no strict correlation between these two variables, meaning that genomes with more than 750 contigs might run better than this one. In conclusion... persevere! I would keep this one running meanwhile, I would try to reduce even further the number of contigs and make another run. Another possible option would be to reduce your set of reference genomes. This might lead to less tangled networks, so it could be worth to give it a try. I hope I've been useful! If you need more advice don't hesitate to ask me. Good luck Emanuele

On Mon, Mar 6, 2017 at 3:27 PM, carlos88morais notifications@github.com wrote:

Dear Emanuele,

As a test, I just deleted the smaller contigs until my assembly had only 700 - because there are 2 other genomes with more than 750 contigs which ran ok. It is running since friday, also stuck in building the network. That makes me wander: Is it just the number of contigs which scales badly? Or maybe the genome size also have a considerable effect on processing time? Even with only 700 contigs, the assembly still have around 11M bases.

Best Regards, Carlos Morais

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/combogenomics/medusa/issues/9#issuecomment-284410446, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQhmbp1-aO4pfrDTlpD385LIwftZn5dks5rjBflgaJpZM4MQQxh .

carlos88morais commented 7 years ago

That worked! Scaffolding of the original assembly (with 889 contigs), reducing the number of references by half, just took 10 minutes Thanks a lot.

EBosi commented 7 years ago

Dear Carlos, I'm happy it did work. Just consider that the obtained result might be suboptimal with respect to a complete reference set. Best Emanuele

On Mon, Mar 6, 2017 at 4:32 PM, carlos88morais notifications@github.com wrote:

That worked! Scaffolding of the original assembly (with 889 contigs), reducing the number of references by half, just took 10 minutes Thanks a lot.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/combogenomics/medusa/issues/9#issuecomment-284430469, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQhmauBPfIMg6nhzOjRuwtVU5wL-dANks5rjCckgaJpZM4MQQxh .

carlos88morais commented 7 years ago

Yes, I'm trying with different sets of references to figure out the best result within reasonable time. Thanks again.