Error processsing large PAF files from minimap2

ekg / seqwish

alignment to variation graph inducer

MIT License

143 stars 18 forks source link

Error processsing large PAF files from minimap2 #67

Open brettChapman opened 4 years ago

brettChapman commented 4 years ago

Hi Erik

I've been running Seqwish with the output from minimap2 and edyeet for comparison, across multiple chromosomes and also testing with multiple whole genomes.

I've found the PAF files from minimap2 to be around twice the size of the PAF files from edyeet. I've tried processing the PAF file from minimap2 which is a 19GB PAF file (from an alignment of chromosome 5H from 4 genomes). However, it ends with a bad_alloc error due to inadequate memory. I'm going to try and run on a larger machine to see if I can get it to complete. The PAF file from Edyeet on the other can be processed with no problems.

I have also found that running edyeet with s50K across 4 whole genomes results in an error (likely a memory issue, as the job was outright killed), but when running with s100K it seems to continue the alignment with no problems, thus far (it's still running after 17 days). Regardless, I fear the resulting PAF file will still be too large to covert into a GFA file using the nodes from my cluster.

In regards to processing large PAF files, would it be conceivable to perform the alignment in parts (like we have discussed before), but instead of merging each PAF file and running the concatenated PAF file through Seqwish, we instead run Seqwish on each smaller PAF file, produce several GFA files, which we then each run through say Smoothxg, then run through a tool like GFAKluge (https://github.com/edawson/gfakluge), to merge all the GFA together, and then run Smoothxg on the final GFA at the end. Would breaking up the process into parts right up to producing the final GFA be a good way of reducing memory overhead? I know Smoothxg should probably be run on the whole GFA, but would iteratively running Smoothxg on several GFAs, and then running it on the final merged smoothed GFAs, ultimately reduce the resource overhead for each individual job? Or would this approach add more issues than it resolves? I see that you're one of the developers of GFAKluge, but it hasn't been updated in a while. Is GFAKluge something worth looking into to add to my pangenome workflow, to help reduce memory overhead? Thanks.

ekg commented 4 years ago

Hi Brett,

What -n and how many cores did you give to edyeet? (I typically run -n less than the number of input genomes.) That seems longer than I would have expected, although I would expect it to take a while there. I need to add output that estimates time to completion.

How much memory do you have on your system? I've never run out of memory for PAF reading and I have worked with 25 human assemblies at once.

As for breaking the problem apart, that's a very good idea. There are several ways, including graph to graph alignment. I'll be exploring them.

On Mon, Oct 5, 2020, 04:53 Brett Chapman notifications@github.com wrote:

Hi Erik

I've been running Seqwish with the output from minimap2 and edyeet for comparison, across multiple chromosomes and also testing with multiple whole genomes.

I've found the PAF files from minimap2 to be around twice the size of the PAF files from edyeet. I've tried processing the PAF file from minimap2 which is a 19GB PAF file (from an alignment of chromosome 5H from 4 genomes). However, it ends with a bad_alloc error due to inadequate memory. I'm going to try and run on a larger machine to see if I can get it to complete. The PAF file from Edyeet on the other can be processed with no problems.

I have also found that running edyeet with s50K across 4 whole genomes results in an error (likely a memory issue, as the job was outright killed), but when running with s100K it seems to continue the alignment with no problems, thus far (it's still running after 17 days). Regardless, I fear the resulting PAF file will still be too large to covert into a GFA file using the nodes from my cluster.

In regards to processing large PAF files, would it be conceivable to perform the alignment in parts (like we have discussed before), but instead of merging each PAF file and running the concatenated PAF file through Seqwish, we instead run Seqwish on each smaller PAF file, produce several GFA files, which we then each run through say Smoothxg, then run through a tool like GFAKluge (https://github.com/edawson/gfakluge), to merge all the GFA together, and then run Smoothxg on the final GFA at the end. Would breaking up the process into parts right up to producing the final GFA be a good way of reducing memory overhead? I know Smoothxg should probably be run on the whole GFA, but would iteratively running Smoothxg on several GFAs, and then running it on the final merged smoothed GFAs, ultimately reduce the resource overhead for each individual job? Or would this approach add more issues than it resolves? I see that you're one of the developers of GFAKluge, but it hasn't been updated in a while. Is GFAKluge something worth looking into to add to my pangenome workflow, to help reduce memory overhead? Thanks.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ekg/seqwish/issues/67, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEPEW7KEO32JY3Q5TLDSJEYJZANCNFSM4SEEA7KA .

brettChapman commented 4 years ago

Hi Erik

I'm using -n 16 which is all the cores on one of my nodes. Do you suggest I change to -n 2 or -n 4? I'm currently testing with 4 genomes. I do plan to go all the way up to 20 genomes, however I'm doubting it will scale well on a single node, hence why I'm look at different ways to break the problem down and do as much processing across multiple nodes as I possibly can.

On each node I have 64GB RAM.

If you think it would be a good idea, I'll try working with GFAKluge and see how the final output looks. I'll try and rerun edyeet/minimap2 and seqwish with only a few cores and see if I can get them to complete that way first.

A graph to graph alignment sounds very interesting!

ekg commented 4 years ago

Hi Brett,

-t 16 would be 16 threads of processing. -n is the number of secondary alignments to keep. By default, edyeet will keep one best mapping (as in mashmap2 mapping), but setting -n tells it to keep up to that many. If you have a four genomes, you could set -n to 2, which would be 3 mappings for each sequence. If everything was collinear and similar, then this would mean that you basically map each sequence to the others.

Leaving -t at 1 and running -n 16 might by why you've seen such a long processing time. Is that possible?

As for merging GFAs, gfakluge won't do that. We need a process to map GFAs together. Thus far, there is nothing available, but there is interest in making one. I need to review what you wrote to try to understand it better. I suggest working to get the scalability of the mapping step up by reducing the number of secondary mappings to a lower level.

On Mon, Oct 5, 2020 at 8:42 AM Brett Chapman notifications@github.com wrote:

Hi Erik

I'm using -n 16 which is all the cores on one of my nodes. Do you suggest I change to -n 2 or -n 4? I'm currently testing with 4 genomes. I do plan to go all the way up to 20 genomes, however I'm doubting it will scale well on a single node, hence why I'm look at different ways to break the problem down and do as much processing across multiple nodes as I possibly can.

On each node I have 64GB RAM.

If you think it would be a good idea, I'll try working with GFAKluge and see how the final output looks. I'll try and rerun edyeet/minimap2 and seqwish with only a few cores and see if I can get them to complete that way first.

A graph to graph alignment sounds very interesting!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ekg/seqwish/issues/67#issuecomment-703432331, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEN7PE56Z2MAOMCHE63SJFTGRANCNFSM4SEEA7KA .

brettChapman commented 4 years ago

Hi Erik

Sorry, I said -n 16. I meant to say -t 16.

For the job which is taking over 2 weeks, I ran with these parameters:

edyeet -X -s 100000 -p 75 -n 20 -a 75 -k 11 -t 16 all_genomes.fasta all_genomes.fasta > pangenome.paf

Would you suggest I use -t 2 and -n 3 instead? Perhaps the -n value could be a function of the number of genomes (number of genomes minus 1), and -t the same but minus 2, and take the choice of values for these away from the user, to apply an optimal number to use?

In regards to GFAKluge, I saw one of its functions on it's Github page was to merge GFA, unless the GFA is a different version or format to the GFAs Seqwish generates?

ekg commented 4 years ago

I think the -k 11 is a potential problem. I would definitely leave that at the default (16). Also, -n 20 seems far too high for just 4 genomes. I would keep that in the order of the number of genomes.

To understand the way things scale, you could run -n 0 for each pair of genomes, getting only the best mapping. Ideally, you'd work your way up from a single pair of chromosomes.

Setting -s higher tends to reduce runtime in the mashmap2 step. There will be fewer potential mappings. The cost of scanning over the target doesn't change much with larger segment sizes.

Did this edyeet job start to write its final output? I definitely need to add logging with expected time to completion. It is relatively easy to do.

On Mon, Oct 5, 2020 at 9:25 AM Brett Chapman notifications@github.com wrote:

Hi Erik

Sorry, I said -n 16. I meant to say -t 16.

For the job which is taking over 2 weeks, I ran with these parameters:

edyeet -X -s 100000 -p 75 -n 20 -a 75 -k 11 -t 16 all_genomes.fasta all_genomes.fasta > pangenome.paf

Would you suggest I use -t 2 and -n 3 instead? Perhaps the -n value could be a function of the number of genomes (number of genomes minus 1), and -t the same but minus 2, and take the choice of values for these away from the user, to apply an optimal number to use?

In regards to GFAKluge, I saw one of its functions on it's Github page was to merge GFA, unless the GFA is a different version or format to the GFAs Seqwish generates?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ekg/seqwish/issues/67#issuecomment-703452622, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQELIA6DO46YQSN5ELP3SJFYHXANCNFSM4SEEA7KA .

brettChapman commented 4 years ago

Thanks for the explanation. I'll keep -n to the total genome count, and keep -k to 16 and see how it runs again. I'll leave the current running edyeet job going just to see how long it takes.

I'll also try a larger -s value later to see how the run time changes.

In regards to the -s value with aligning smaller regions, what would be an appropriate value for multiple gene sequence regions? I tried running edyeet on an extracted gene region +/- 1000bp either side, and found -s larger than the longest sequence length to kill the job. I then ran -s with the length of the shortest sequence among all the gene sequences, and it ran fine. However there were a few gaps in the alignment compared to a minimap2 job I ran as well. Perhaps I could run -s at half the length of the shortest sequence? As a s100K value for whole genomes/chromosomes would be only a fraction of the roughly 600Mbps chromosome length, so a much shorter -s value (perhaps in the range of 500 to 1000) for aligning gene regions may be more appropriate (for my example gene I ran -s with a value of around 5000)

The currently running edyeet job still hasn't started outputting any final output.

A log with expected time to completion would definitely help a lot with planning and testing out parameters quickly.

ekg commented 4 years ago

You can use edyeet -M and -s set to your smallest input sequence size (or some number lower than that) to get merged mappings for your genes.

If you're aligning shorter sequences, you might also want to look at wfmash. That has some memory limitations for sequences >50kb (things grow quadratically wrt the score of the alignment) but might give you better alignment quality because it's affine gapped.

On Mon, Oct 5, 2020 at 9:53 AM Brett Chapman notifications@github.com wrote:

Thanks for the explanation. I'll keep -n to the total genome count, and keep -k to 16 and see how it runs again. I'll leave the current running edyeet job going just to see how long it takes.

I'll also try a larger -s value later to see how the run time changes.

In regards to the -s value with aligning smaller regions, what would be an appropriate value for multiple gene sequence regions? I tried running edyeet on an extracted gene region +/- 1000bp either side, and found -s larger than the longest sequence length to kill the job. I then ran -s with the length of the shortest sequence among all the gene sequences, and it ran fine. However there were a few gaps in the alignment compared to a minimap2 job I ran as well. Perhaps I could run -s at half the length of the shortest sequence? As a s100K value for whole genomes/chromosomes would be only a fraction of the roughly 600Mbps chromosome length, so a much shorter -s value (perhaps in the range of 500 to 1000) for aligning gene regions may be more appropriate (for my example gene I ran -s with a value of around 5000)

The currently running edyeet job still hasn't started outputting any final output.

A log with expected time to completion would definitely help a lot with planning and testing out parameters quickly.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ekg/seqwish/issues/67#issuecomment-703465835, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQENWFOB6HTLT7JFBZO3SJF3PDANCNFSM4SEEA7KA .

brettChapman commented 4 years ago

Thanks. I'll try edyeet with -M and also give wfmash a go and see how the results compare.

ekg commented 4 years ago

They both now write estimate time to completion during mapping and alignment.

On Tue, Oct 6, 2020, 06:38 Brett Chapman notifications@github.com wrote:

Thanks. I'll try edyeet with -M and also give wfmash a go and see how the results compare.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ekg/seqwish/issues/67#issuecomment-704022173, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEIKBSGIJKIFFSFGZ33SJKNKXANCNFSM4SEEA7KA .

brettChapman commented 4 years ago

Thanks! I'll try out the new builds.

In regards to running edyeet with a lower -n value and sticking with -k 16, I just ran edyeet overnight and it failed with 4 genomes:

I ran this:

edyeet -X -s 100000 -p 75 -n 3 -a 75 -k 16 -t 16 all_genomes.fasta all_genomes.fasta > pangenome.paf

The job received this error:

srun: error: node-5: task 0: Killed
srun: Force Terminated job step 2119.1

The job I started up over 2 weeks ago using a -n 20 and -k 11, is still running. I imagine it will eventually get killed off as well.

I'm guessing it's a memory problem. Could I try any other parameters, perhaps a much larger -s value, or do you think at this stage by adding more genomes the problem is simply inadequate memory, and the only way forward is by splitting up the alignments to 2 genomes per job and merging the PAF files is the only way to go? I've concluded as much with my minimap2 runs. I'm currently testing multiple alignment jobs and concatenation of PAF files with minimap2 now. Each genome is roughly 4Gbp.

ekg commented 4 years ago

It looks like a memory problem.

Try to break it down as you've suggested. First look at memory consumption for a pair of chromosomes that you expect to be homologous. If that's doable, you can write a script to get the PAF for all homologous chromosome pairs. That'd be fine as input unless you want to see translocations.

On Wed, Oct 7, 2020, 03:44 Brett Chapman notifications@github.com wrote:

Thanks! I'll try out the new builds.

In regards to running edyeet with a lower -n value and sticking with -k 16, I just ran edyeet overnight and it failed with 4 genomes:

I ran this:

edyeet -X -s 100000 -p 75 -n 3 -a 75 -k 16 -t 16 all_genomes.fasta all_genomes.fasta > pangenome.paf

The job received this error:

srun: error: node-5: task 0: Killed srun: Force Terminated job step 2119.1

The job I started up over 2 weeks ago using a -n 20 and -k 11, is still running. I imagine it will eventually get killed off as well.

I'm guessing it's a memory problem. Could I try any other parameters, perhaps a much larger -s value, or do you think at this stage by adding more genomes the problem is simply inadequate memory, and the only way forward is by splitting up the alignments to 2 genomes per job and merging the PAF files is the only way to go? I've concluded as much with my minimap2 runs. I'm currently testing multiple alignment jobs and concatenation of PAF files with minimap2 now. Each genome is roughly 4Gbp.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ekg/seqwish/issues/67#issuecomment-704643094, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEK6TUIY42B6QHQNMPDSJPBWJANCNFSM4SEEA7KA .

ekg commented 4 years ago

I would also start backing off on the parameters. What level of divergence do you expect between the sequences? Set -p and -a a little below that. Using a shorter segment length is probably not a bad idea.

brettChapman commented 4 years ago

Ok, thanks I'll try with lower parameter values. Perhaps -a 50 -p 50 and -s 50000 or -s 75000.

The divergence between varieties is quite variable. Some varieties are recently diverged being crossed by breeders over the last decade or two, or perhaps more recently, while others are wild-type varieties and may have diverged <10K years ago.

ekg commented 4 years ago

50% divergence is too high. It actually maxes out at 70%. But that's the limit for what we can meaningfully align with this strategy.

If these are all varieties of the same species then I would actually keep -p and -a lower. Try setting them close to your expected pairwise diversity. In this case 90 or 95 might be good. This will make the alignment much faster and dramatically less memory intensive.

On Thu, Oct 8, 2020, 09:58 Brett Chapman notifications@github.com wrote:

Ok, thanks I'll try with lower parameter values. Perhaps -a 50 -p 50 and -s 50000 or -s 75000.

The divergence between varieties is quite variable. Some varieties are recently diverged being crossed by breeders over the last decade or two, or perhaps more recently, while others are wild-type varieties and may have diverged <10K years ago.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ekg/seqwish/issues/67#issuecomment-705400259, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEPFZGHNGBD247WA67TSJVWKNANCNFSM4SEEA7KA .

brettChapman commented 4 years ago

Ok, so perhaps -a 90 -p 90 -s 50000 ? I'll submit another job with these values.

ekg commented 4 years ago

Yeah, that or even higher. I would start high and go down. Start at 95 or even 99 (if your sequencesa re really similar).

On Thu, Oct 8, 2020 at 10:10 AM Brett Chapman notifications@github.com wrote:

Ok, so perhaps -a 90 -p 90 -s 50000 ? I'll submit another job with these values.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ekg/seqwish/issues/67#issuecomment-705406428, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEI4ET234WQ473DDYZDSJVXXBANCNFSM4SEEA7KA .

brettChapman commented 4 years ago

I'm running jobs now on chromosome 2, across 4 genomes. p99/a99/s50k, p95/a95/s50k, p90/a90/s50k, p90/a90/s100k.

I have a feeling 90% will be the sweet spot. I've just been looking closely at some of the gene regions I've aligned, and some are weird because I pulled in too many genes from other regions from the mmseq2 step, which had a minimum identify of 70% and coverage of 80%. Looking at the results, if I filter at 90% identify, I remove huge numbers of mismatches from around 65 right down to the 10s, and now the gene numbers are 1 per chromosome, although sometimes I do get duplicate genes on the same chromosome in some cases, even with high percent id thresholds. I'm now running my pangene pipeline again with 90% identify and 80% coverage. I'll check the gene alignments again and see if they all look better. Looking at the gene regions of several genes is probably a good indicator of what thresholds to apply at the genome level.

brettChapman commented 4 years ago

Hi Erik

I now have some results from the chromosome 2H alignment across 4 genomes. I ran smoothxg with spoa (because abPOA is buggy). I then ran ODGI stats, view etc.

Here is a summary on the stats (sum in node space and sum in nucleotide space, obtained from ODGI stats):

2H-edyeet_comparison

p99/a99/s50k alignment: Morex_Igiri_Akashinriki_Barke_2H_edyeet_s50000_p99_a99_n2_k16 viz

p95/a95/s50k alignment: Morex_Igiri_Akashinriki_Barke_2H_edyeet_s50000_p95_a95_n2_k16 viz

p90/a90/s50k alignment: Morex_Igiri_Akashinriki_Barke_2H_edyeet_s50000_p90_a90_n2_k16 viz

p90/a90/s100k alignment: Morex_Igiri_Akashinriki_Barke_2H_edyeet_s100000_p90_a90_n2_k16 viz

Here is a tree of the 20 genomes, produced from mashtree. You can see that Morex is closer to Igri than the two others, which is also apparent from the ODGI stats based on the small differences in node/nucleotide space between them.

pangenome_tree_bootstrap_1000rep

Based on these results, I'm leaning towards p90/a90/s50K, as I'd expect the nucleotide/node space to be more similar between Morex and Igri. The alignments from p90/a90/s50K also look less fragmented when compared to the other alignments.

What are your thoughts? Do you think that is convincing evidence?

Thanks.

AndreaGuarracino commented 4 years ago

Hi Brett,

thank you for your tests and your always detailed posts.

Are you using odgi stats -l/-s to generate those tables, right? It is interesting how you are using them to look at the genomes' relationships. However, those metrics were born to evaluate the goodness of the graphs sorting (in 1D, we are developing them also for 2D evaluation), so they could be not the best way to judge what you would like. I am thinking the case where the genomes are related, but for any/stochastic reasons, their sortings have gone slightly different ways.

brettChapman commented 4 years ago

Hi Andrea

Thanks.

Yeah, I used -l and -s with odgi stats:

odgi stats -t 16 -i output.og -S -C -l -s -P > output.stats

How would you suggest I determine the most appropriate edyeet alignment parameters to use? All my varieties are of the same species, some diverging recently over the last 10, 20, or 50 years (developed by breeders), and other wild-type varieties diverging <10K years ago since the start of agriculture.

AndreaGuarracino commented 4 years ago

Hard to say, but I would have followed your same strategy, that is to evaluate which result best reflects the evolutionary relationships that are already known a priori, or that are expected to be observed. Therefore, a metric would be needed to assess paths similarity, such as shared nodes, or a more refined phylogenetic analysis. However, keeping an eye on those metrics (odgi stats -l and -s) is a good thing, but you can also improve them sorting downstream the obtained graphs using odgi sort.

brettChapman commented 4 years ago

Thanks for the advice. I've now ran it through odgi sort:

p99/a99/s50k alignment: Morex_Igiri_Akashinriki_Barke_2H_edyeet_s50000_p99_a99_n2_k16 sorted viz

p95/a95/s50k alignment: Morex_Igiri_Akashinriki_Barke_2H_edyeet_s50000_p95_a95_n2_k16 sorted viz

p90/a90/s50k alignment: Morex_Igiri_Akashinriki_Barke_2H_edyeet_s50000_p90_a90_n2_k16 sorted viz

p90/a90/s100k alignment: Morex_Igiri_Akashinriki_Barke_2H_edyeet_s100000_p90_a90_n2_k16 sorted viz

I also redid the odgi stats, and the node space and nucleotide space no longer correlates to the evolutionary relationship between varieties. How can I make good use of odgi stats? Or is it mainly used for testing optimal parameters for graph building?

I saw that odgi view can covert back to GFA again. Is that something worth doing after sorting? Would it improve the graph analysis and visualisation downstream with VG?

In regards to the best edyeet parameters, I think I'll just have to go on how well the alignment looks. I've tested with smaller gene regions as well, and for genes without any huge indels, using -a 90 and -p 90 works well. As for the segment length for small gene regions of several hundred to thousands of bp, I'm trialing using -s 500. For whole chromosomes, I'm thinking -s 50k, as can be seen in the alignments using -s 100k tends to see a more fragmented alignment. I'll try -s 10k and see how it compares, however barley is highly repetitive, so going lower than 50k may not be such a good idea.

brettChapman commented 4 years ago

I've settled on these alignments:

s100k, p90, a90: Morex_Igiri_Akashinriki_Barke_2H_edyeetParams-s100000-p90-n2-a90-k16_seqwishParams-k8_smoothxgParams-w10000-j5000-k0-e100-l10000 odgi_viz_alignment

s100k, p90, a85: Morex_Igiri_Akashinriki_Barke_2H_edyeetParams-s100000-p90-n2-a85-k16_seqwishParams-k8_smoothxgParams-w10000-j5000-k0-e100-l10000 odgi_viz_alignment

a100k, p85, p90: Morex_Igiri_Akashinriki_Barke_2H_edyeetParams-s100000-p85-n2-a90-k16_seqwishParams-k8_smoothxgParams-w10000-j5000-k0-e100-l10000 odgi_viz_alignment

I chose a segment length of 100Kbp due to the fact these are often large chromosomes and the repeat content can be quite high around 80%. I chose either p85/90 and a85/90 as this would be around the percent divergence and the alignments don't appear to be too fragmented. I'm undecided on whether -a should be near or lower than -p, or at the same? On Erik's PGGB readme page he suggests -a should be near or lower than -p, to ensure a base-level alignment for the mapping.

What is typically used? I'd like to get a good base-level mapping and alignment. Thanks.