fgvieira / ngsLD

Calculation of pairwise Linkage Disequilibrium (LD) under a probabilistic framework
GNU General Public License v2.0
45 stars 7 forks source link

Ideas for speeding up prune_graph.pl #39

Open jtoy7 opened 2 years ago

jtoy7 commented 2 years ago

I'm wondering if you have recommendations for speeding up LD pruning with prune_graph.pl. I have been running the script for almost 3 days now, and it is progressing quite slowly. I have a large number of SNPs but the max_kb_dist I set is pretty small (10 Kb). At this point is has only trimmed 1000 SNPs of 4.1 million, so this doesn't seem like a feasible rate. I assume that this step would not be helped much by parallelization because it is iterative in nature? I'm running it on a 256GBx44 core node and the memory usage seems fine. The most recent output is below. Thanks for any thoughts!

### Parsed a total of 1538686765 edges (6865812 kept) between 9020425 nodes
### Found 4865432 unlinked nodes; these will be printed and removed from graph now!
### Pruning graph (6865812 edges between 4154993 nodes)
# 1000 nodes pruned; 4153974 linked nodes remaining ...
jtoy7 commented 2 years ago

Full run command is here:

perl $NGSLD/../scripts/prune_graph.pl \
--in_file $BASEDIR/results/ngsld/ALL_maxd_1sd.ld \
--max_kb_dist 10 \
--min_weight 0.5 \
--out $BASEDIR/results/ngsld/ALL_maxd_1sd_unlinked.id
fgvieira commented 2 years ago

I have recently merged #31 adding another pruning script in python (prune_ngsLD.py). Can you give it a try?

jtoy7 commented 2 years ago

Thanks Filipe. I've got the new python script running now. I'm trying it on a large contig, and while it seems to be running faster than the perl version, it has been running for 3 hours now and only pruned 3500 nodes, with 2,814,000 remaining. Here's the output so far:

Checking if file is gzipped...
Reading in data...
Filtering edges by distance...
Filtering edges by weight...
Beginning pruning by dropping heaviest position...
Starting with 2817418 positions with 2147648 edges between them...
2817000 positions remaining with 2129850 edges between them...
2816000 positions remaining with 2100857 edges between them...
2815000 positions remaining with 2077131 edges between them...
2814000 positions remaining with 2055787 edges between them...
fgvieira commented 2 years ago

If you want to speed it up further, then you can split your input file by chromosome, and launch each separately.

jtoy7 commented 2 years ago

I have actually been taking this approach to split by linkage group. It still seems to take a long time for the longest chromosomes, however. For example, for my largest chromosome, the script has been running for over 2 days and has pruned ~90,000 SNPs, but there is still a long way to go. The last output was: 2721000 positions remaining with 1184903 edges between them

fgvieira commented 2 years ago

In theory, the script could be parallelized but it would require some code refactoring. As of now, the script iterates over nodes, and my guess would be that the slowest step on each iteration is finding the heaviest node: https://github.com/fgvieira/ngsLD/blob/ecee582ecf92afea91b4c30cde27df47b6b57625/scripts/prune_ngsLD.py#L171-L174

We could iterate over "clusters" of linked nodes and, for each cluster, find the heaviest node. This way, we could thread the cluster part (and each thread would find the heaviest node on that cluster).

@zjnolen, what do you think? Do you see a better/easier/faster way?

jtoy7 commented 2 years ago

Hi all. I wanted to post an update. It turns out that some chromosomes were taking an inordinate amount of time to run because of a silly parsing bug in my code. I have since fixed the issue and ran an array job across LGs (12 LGs running at a time) using prune_ngsLD.py. It completed the run, trimming 9.02 million SNPs down to 6.70 million unlinked sites in a little over 5 hours. Thanks for all the help!

The parallelization you describe above still sounds like a cool idea!

zjnolen commented 2 years ago

Great to hear that you got it working for you, @jtoy7! 😄

@fgvieira, the way you describe sounds great to me! I think it should be quite possible to implement - there's a function to pull out the largest disconnected component from the graph, and it seems to identify these rather quickly. Each component could then be pulled from the graph and pruned in its own thread from the start. The only problem I see here would be that if it starts out with no disconnected components, then it will essentially be the same as no parallelization. But a regular check could be done within the threads, and if multiple components are found, they are pulled out and added to a thread queue.

I think this can be done with the multiprocessing module in python, but I've not really used it before, so my understanding of it is rather vague. I can give it a try at some point when I have some time, especially if the current version ends up still slow for some datasets.

fgvieira commented 1 year ago

I just made a more efficient alternative to the pruning scripts. It is threaded, but I haven't seen it use more than 3 or 4 threads. Do you think you could give it a try and see how it goes?

zjnolen commented 1 year ago

Hi @fgvieira,

I've just gotten around to trying out prune_graph as I've gotten back into some analyses where I'm pruning again.

I gave one of the larger chromosomes a try using both prune_graph and the prune_ngsLD.py and your new prune_graph method has definitely been much more memory efficient, using less than 10GB RAM while prune_ngsLD.py was using ~55GB for this chromosome.

I do find that prune_graph has been slower than the python script, which finished up this chromosome (403147 nodes with 304191755 edges (86869875 edges <50kb and >0.1 r2)) in 2 days 6 hours, while prune_graph is about to reach 4 days with 140000 nodes remaining (the python one pruned to ~4000 nodes before removing all edges).

One thing I've noticed, is that the threading hasn't really been working for the chromosomes I've tried, which might explain why it has been slower. I've been using -n 4 to try and give it 4 threads, but it never really uses more than 1 through the whole run. Could this be driven by some specific characters of the graph maybe? If the threading you've implemented works as you were thinking of earlier here, then maybe it is that the graph I get from these chromosomes is essentially one cluster throughout the run.

fgvieira commented 1 year ago

Thanks for giving it a try and compare prune_ngsLD.py and prune_graph.

Glad to hear that it is more memory efficient but I'd also expect it to be faster. Can you let the prune_graph run finish and compare results? There is another issue (fgvieira/prune_graph#2) where a user reports different results between prune_graph and prune_graph.pl and I am trying to figure out if there is a bug in prune_graph; on that note, do you think you could also run prune_graph.pl on that chromosome and compare all three outputs?

As for the threads, I implemented what I described above, but it only really makes a difference if you have a lot of large clusters (where it takes a lot of time to find the heaviest node). Since you have a single chromosome, it could be that (as you said) you have a single cluster throughout the run. You can give it a try with several chromosomes and see if it uses more than 1 thread, but most of the times it never uses more than 2/3 threads due to few or small clusters.

zjnolen commented 1 year ago

For sure, I'll give comparing all three a go on this chromosome. I usually break up my analyses into roughly equal sized 'chunks' that are around the size of the largest chromosome, so I also have some LD files for chunks with multiple chromosomes, but with similar total size, so those should be able to take advantage of the threads. I'll compare the three prunings on one of those as well and let you know how things go for each.

zjnolen commented 11 months ago

This took a little longer than expected! Our HPC has been having a rough month, so the perl runs kept getting killed before finishing.

I ran this on two beagle files, one that was made up of one chromosome, and one made up of two, both beagles totaling to roughly 225000 SNPs. In both cases there were about 15000000 edges to prune after filtering. The three methods had slightly different numbers of edges after filtering (r2 >= 0.1 & maxkb <=50000), which I suppose might be related to the precision issues that were mentioned in the other issue? I was a bit surprised, as we tested that the perl and python scripts should at least arrive at the same outputs, but maybe this larger dataset had more room for precision variations to come out.

While the three scripts didn't return the same sites for this reason, I did get roughly similar outputs, so I don't find any bugs there:

Memory-wise, prune_graph really improves things:

For runtime, the multi-threading in prune_graph helped when multiple linkage groups were in the beagle, but, especially for a single linkage group, it was slower than the python script:

I think the run time differences might end up being quite dataset and filter setting specific. The python script might still be useful for datasets where linkage groups are large and SNP density is high, like I seem to have here. Outside of these situations, though, I would think prune_graph would still be the most efficient both in terms of memory and runtime.

fgvieira commented 11 months ago

Thanks @zjnolen for the thorough comparison! Good to know that results are similar (differences probably due to precision), but it is surprising that the python script is 3x faster on some datasets! I'd expect it to be, at most, as slow/fast as the rust version since it it is not threaded.

Do you think you can share the first dataset? I'd like to take a closer look where the bottleneck might be.

zjnolen commented 11 months ago

Yes, I expected the same, so I was surprised as well! I'll attach the beagle file for the test data here. Let me know if there's any info about the dataset or study organism I can share with you that might be useful.

testdata_chr2.beagle.gz

fgvieira commented 11 months ago

I was referring to the LD output file. Can you also send it?

zjnolen commented 11 months ago

Sure thing, it's too large for here I think, so here's a link:

https://www.transfernow.net/dl/20231201rTURet1q

fgvieira commented 11 months ago

Did some tests and, just like you found with the python script, current parallelization is actually slower than no parallelization! I think it is because right now, on each iteration, the graph is split into (unconnected) sub-graphs and each threads finds the heaviest node on each one. However, finding the sub-graphs is slower than just traversing the whole graph (like in the python script), since to find the sub-graphs it needs to traverse the whole graph twice!

Another approach would be to parallelize the calculation of the weights.

fgvieira commented 10 months ago

Made a new version without the components and adding parallelization on the calculation of the node weights, and it seems the be faster. Here are the number I got on 10M edges:

Can you run your tests again and see if you see the same trends?

zjnolen commented 10 months ago

Ah ha, that would make sense then for why it was slower!

Gave it a try yesterday for both the datasets I previously tested with:

Run times before for comparison:

perl: 90h & 131h
python: 11h & 14h
rust: 34h & 16.6h

And now:

rust: 8.6h & 11.4h

So seems to also be looking much faster on my end as well! I set it to 4 threads here, though I don't see it using more than 2 or 3 on average, so I think that like with your times, the speed improvement will diminish after a certain number of threads. I can also send a few jobs off with other levels of threading as well to get an idea of how it scales in datasets of this size.