Very large number of nodes remain in component

Shamir-Lab / SCAPP

SCAPP is a plasmid assembly tool. This tool is described in our paper: https://microbiomejournal.biomedcentral.com/articles/10.1186/s40168-021-01068-z

MIT License

29 stars 6 forks source link

Very large number of nodes remain in component #24

Open jnwoodhouse opened 3 years ago

jnwoodhouse commented 3 years ago

I am analysing some freshwater metagenomes and we really have no idea what to expect when it comes to plasmids (ie size),

I am somewhat stupidly stuck on the "first" sample which takes a very long time to run. What is obvious is that as it goes through what im guessing is the "peeling off" process that the time taken to consider each component is relative to the number of nodes in the graph.

For instance initially it reports (quite quickly)

252 nodes remain in component 243 nodes remain in component 217 nodes remain in component 201 nodes remain in component 194 nodes remain in component 182 nodes remain in component 182 nodes remain in component

but then much later (around 48h later)

its displaying

2355264 nodes remain in component 2355264 nodes remain in component 2355264 nodes remain in component 2355264 nodes remain in component 2355264 nodes remain in component 2355262 nodes remain in component 2355262 nodes remain in component 2355262 nodes remain in component

A single "peeling" process takes around 5 h and you see the number of nodes remaining often is not reduced, only 2 nodes in total.

Im wondering: Is this something that should be discarded (super high number of nodes), is there a setting that im missing that would circumvent such graph components being profiled? Or should i be getting super excited and ready to write a Nature paper??

dpellow commented 3 years ago

Yes you are correct, SCAPP processes the assembly graph one component at a time and the processing time for each component increases with the number of nodes it contains.

There is no direct way to get around this (although you could increase the number of threads if you have the compute resources). However, more than 2 million nodes in a component sounds very large, maybe there is a way to construct a smaller graph (that better assembles the sequence into fewer but longer nodes): how many reads is the sample? What assembler did you use to assemble it? What was the value of k?

jnwoodhouse commented 3 years ago

The sample has around 100M reads (14,6B bases)

We ran spades in meta mode, the max K should be 77. We have over 1000 samples as part of this project and we have already done the assembly and binning for all samples. I was optimistic that we could look through the samples also for plasmids but obviously if it takes 5+ days per sample we might have a small problem. Currently I am running on 1.5TB cluster with the option for up to 96 threads.

Related to this I wonder if you can specify how SCAPP handles memory use. I noticed, when running on a smaller system that each process takes around 20 gb of memory so if you have more than 10 threads running you start to run out of memory. I somewhat calculated this back on the larger cluster so that I limited to 65 threads (65*20 = 1.3TB) so that there would be a bit of leeway.

dpellow commented 3 years ago

OK, it sounds like the samples are pretty big and the metagenomes are probably diverse so the graph will be very large. On large samples we have had success using larger values of maximum k such as 91 or 127 to construct less fragmented metagenomes with longer contigs and fewer nodes in the assembly graph (depends on the read length as well - longer reads will work better with larger k). That might be something you're interested in looking into as well.

SCAPP is written in Python and the multiprocessing support of Python is admittedly not great... Indeed some shared data structures are duplicated in each process so the memory usage will grow with the number of processes (a 200GB server seems a bit low to work with such large metagenomes).

If using larger k to simplify the fragmented graph isn't enough, then I think the correct thing to happen here is that we modify SCAPP to have a max component size parameter and any components larger than this would be broken up into smaller components. You would lose plasmids that cross the boundaries between slices of the original component, but it would allow extremely large components to be processed in a reasonable amount of time, so there is a trade-off.

It would take me a bit of time to get to implementing, testing and releasing this sort of modification so it depends how time sensitive your plasmid assembly is if it will help you.

jnwoodhouse commented 3 years ago

Hey David,

We are not specifically time-sensitive at this stage. The job is still running and has managed to peel of 14 nodes. I think I will hit my time limit soon and it will end. On this note while im making demands and creating work for you, completely lacking any knowledge of Python, maybe there would be an option to add checkpoints to the workflow so that it can continue using already generated files ie plasclass.out

I think our options for going back and redoing the assemblies with a higher k value are limited, we are relying on the charity of others. We are not strictly targetting mobile elements with our study, but have a nice time-series so we were interested in seeing what we could pull out. Like i said im stuck on the first sample, so maybe its time to work through the other 1000+ samples and see how often this pops up as a problem. Maybe it is not so common.

dpellow commented 3 years ago

OK, I will try to implement this, but it would probably only get done in ~2-3 weeks...

There are options to use the intermediate files, if you need to run again on the same input sample I would definitely recommend using them to save a lot of time: -b <bam file> instead of the read files options -r1 and -r2, -pc <plasclass file>. At the stage the sample you are running is currently at, both of these files should be in the intermediate_files subdirectory of your output directory.

I'd guess that if the other samples are all of similar read depths from similar environments they will run into the same issue.

(Note that SPAdes also has the --restart_from option that would actually let you continue running with an expanded set of values for k rather than re-running all the assemblies from the beginning if that is an option)

jnwoodhouse commented 3 years ago

Classic, of course I miss the -pc flag. Sorry,

Regarding spades restart, unfortunately we had to remove the intermediate files to save physical space.

dpellow commented 2 years ago

@jnwoodhouse I am starting to work on modifying SCAPP to first break up any extremely large graph components into smaller chunks (using the Louvain algorithm).

I don't have any assembly graphs that are on the order of magnitude of the sample that you are running SCAPP on. Would you be willing to share the fastg file so I could use it to tune the parameters of the algorithm and get chunks of the right size distribution?