fgvieira / ngsLD

Calculation of pairwise Linkage Disequilibrium (LD) under a probabilistic framework
GNU General Public License v2.0
43 stars 7 forks source link

Does this pruning method seem appropriate? Plus, uncertainty about pruning beagle format #25

Closed zjnolen closed 2 years ago

zjnolen commented 2 years ago

Hello!

Thanks for this great tool! It seems super useful for the genotype likelihood tool box :)

This is not an issue, more a question to make sure I am thinking correctly about a couple things while using it, so feel free to close or delete if it isn't appropriate here. I have been trying to use ngsLD to linkage prune some low coverage data for usage with PCANGSD and NGSadmix. I have run into long run times for the pruning script, as I've found mentioned on here before, and understand it's a proof of concept and not meant for production. I gave putting a similar one together a try in a different library that might be faster, but wasn't sure if I had gotten all the steps right and thought I'd check with you. Here's the main steps of the script:

import pandas as pd
from graph_tool.all import *

# reads in ngsLD output
df = pd.read_table(args.input, header = None)
df.columns = ['site1','site2','dist','r2pear','D','Dp','r2']
# drops unused columns
df = df.drop(columns=['r2pear','D','Dp'])
# sets a parameter 'drop' which returns 1 if a pair has a distance greater than the specified cutoff or an r2 of less than the specified cutoff
df['drop'] = (df['dist'] > 50000) | (df['r2'] < 0.1)

# sets up graph with poperties for distance, r2, and drop on the edges, and for a cumulative weight on the vertices
G = Graph(directed=False)
dist = G.new_edge_property("int")
r2 = G.new_edge_property("double")
drop = G.new_edge_property("bool")
weight = G.new_vertex_property("double")

# adds edges from dataframe
name = G.add_edge_list(df.values, hashed=True, eprops=[dist,r2,drop])
# filters out edges where drop = 1
G.set_edge_filter(drop,inverted=True)

# loop over vertices, dropping neighbors of the heaviest vertex in each iteration
while True:
    edges = G.num_edges()
    if edges == 0:
        break
    # calculate cumulative weight (sum of r2 of all adjacent edges) for each vertex
    incident_edges_op(G, "out", "sum", r2, weight)
    # calculate maximum cumulative weight
    max_weight = max(weight)
    # find index of heaviest node
    heavy = find_vertex(G, weight, max_weight)
    # find and remove all neighboring vertices of heaviest node
    heavy_neighbors = G.get_out_neighbors(heavy[0])
    G.remove_vertex(heavy_neighbors, fast = True)
# grabs names of all remaining nodes (pruned sites) into a list
pruned_df = pd.Dataframe([name[v] for v in G.get_vertices()])

Do these steps match the steps intended in the pruning script here? If so, it seems it can a be a significant speed up, mostly due to read in and calculation times for the graph objects, since it's still centered around the same style loop. I've found that on some test data I put together I could prune a ~250MB ld file in about half an hour, whereas it had not finished after 24 when testing with the other script. I also have tested it now with a whole chromosome that made for a 16GB ld file, which took one hour to prune. So if this seems to match the pruning method properly, I will move forward with this on my data and maybe it can be of use to others.

My other question is about beagle file 'etiquette'. Is there any reason why one should re-run angsd with the -sites option rather than using something like grep to pull a list of lines from the beagle file when it comes to pruning? I imagine it would be faster than re-running, but I wasn't sure if there is something that will change when limiting the analyses to specific sites compared to just filtering the file. Similarly, if making beagle files per chromosome using the same bamlist, is there any reason against merging them with cat at the end to get the genome file? Maybe the multiple headers?

Thanks for any insights!

fgvieira commented 2 years ago

Hi @zjnolen and thanks for your interest in ngsLD!

ngsLD has several ways to calculate a node's weight (sum of (a)bsolute edges' weight [default], sum of (e)dges' weight, or (n)umber of connections), and allows to both keep and discard the heavier nodes. In your code it seems you calculate the weight as a normal sum (not abs) and keep the heaviest node. I am not very familiar with the python library graph_tool but, if you are comparing with ngsLD --weight_type e --keep_heavy, it seems correct. If you end making a script with this new method, I'd gladly include it in ngsLD; just attach it here or make a PR.

Not sure why one should re-run angsd with the -sites option after LD prunning, but you can just grep the sites from the initial beagle file; just be carefull about site names (both chr1_1234 and chr1_12345 match chr1_1234). I guess the main reason might be that, for LD prunning you only need the beagle, but after that you might want to do other analysis (e.g. SAF) where you need to re-run angsd anyway.

About merging, the problem are the headers so, if you can deal with them properly, you can just cat beagle files. :smile:

zjnolen commented 2 years ago

Hi Filipe,

Ah, yes I missed the absolute value part, that should be fairly straightforward to add in, I think. Once I put it together into a proper script I'll be sure to make a PR with it so you can include it here, thanks for the interest!

Those both make sense with the beagle file, so I'll keep them in mind when pruning and merging. Thank you for your help!

g-pacheco commented 2 years ago

Hello @zjnolen -- may I ask if you have had the time to finish this implementation? It would be amazing if you could share what you have developed to cope with this issue.

Many thanks in advance, George.

zjnolen commented 2 years ago

Hey @g-pacheco, I had gotten it mostly working, but was still missing a couple of features from the original I wanted to add before submitting. I added those in now and have opened a PR (#31), so you can find the script there!

Best, Zach

g-pacheco commented 2 years ago

Hello @zjnolen,

Thanks a lot for sharing your version in Python. I have tested it quickly on a panel of 100K SNPs and 70 individuals and it is indeed much faster (23min in my case).

Thanks a lot, George.