OpenMendel / SnpArrays.jl

Compressed storage for SNP data
https://openmendel.github.io/SnpArrays.jl/latest
Other
44 stars 9 forks source link

Prune a kinship matrix to eliminate high relatedness #58

Closed Hua-Zhou closed 4 years ago

Hua-Zhou commented 4 years ago

It's a common task in the genetic analysis workflow to prune kinship matrix to eliminate highly related individuals according to a threshold of kinship values. GCTA recommends a default threshold of 0.025 (2nd degree cousin, translated to 0.0125 in our notation).

The function interface can be simple as

keepidx = prune(Φ::AbstractMatrix; kinship_threshold::Number = 0.0125)

The output is a BitVector such that Φ[keepidx, keepidx] has no off-diagonal entries above kinship_threshold.

A simple heuristic algorithm can be:

  1. Initialize keepidx to be all trues.
  2. Calculate the number of (off-diagonal) entries above kinship_thresold per column.
  3. Find the column with maximum number of entries above kinship_threshold and set the corresponding keepidx entry to false.
  4. Update the counts from Step 1 for remaining columns. Repeat Steps 3-4 until no columns have counts > 0.

Any reference to how current GCTA and PLINK do it will be helpful.

kose-y commented 4 years ago

Link to relevant GCTA code: https://github.com/cooljeanius/gcta/blob/e2a2bfb116ed018ba6a9fe733035bca01055eeca/grm.cpp#L450

kose-y commented 4 years ago

and PLINK code: https://github.com/chrchang/plink-ng/blob/1eb693891de475a180d04806e75cf734b7ded3ac/2.0/plink2_matrix_calc.cc#L290 (whew, this thing is full of goto statements)

kose-y commented 4 years ago

https://www.cog-genomics.org/plink/2.0/distance

https://www.cell.com/ajhg/pdf/S0002-9297(10)00598-7.pdf (around Eqn. 3)

kose-y commented 4 years ago

Here is what GCTA does, in graph-theoretic terminology (node=sample, edge=off-diagonal entry higher than the threshold, degree=number of off-diagonal entries higher than the threshold):

  1. Count degree of each node
  2. For each edge, add the connected node with a higher degree to the "deletion list."
  3. Remove each unique node in the deletion list.
kose-y commented 4 years ago

Here is what PLINK does (from line 174 of the linked file):

  1. count the degree of each remaining sample
  2. as long as edges remain, a. remove partner of first degree-one node, if such a node exists b. otherwise, remove first maximal-degree node
kose-y commented 4 years ago

The goal here is to keep as many nodes as possible while removing all edges: equivalent to maximum independent set problem. There seem to be a couple of better algorithms that can be implemented in Julia in the Wikipedia link. Still, considering what GCTA and PLINK are doing, I don't think we need that complicated and probably slower algorithms.

Hua-Zhou commented 4 years ago

Thanks for the nice summary, @kose-y! It's worth trying both Plink and GCTA strategies to see which one gives better results (keep more individuals). A related question is, if a user asks for a specific number of individuals to keep, the function finds an optimal (or nearly optimal) way to remove individuals such that the maximal entry of the remaining matrix is minimized. It seems a harder task.

We may open another issue regarding LD pruning of SNPs so that remaining SNPs are not very highly related. A literature review of how Plink does it is helpful.

On Sun, Jul 26, 2020 at 5:22 AM Seyoon Ko notifications@github.com wrote:

The goal here is to keep as many nodes as possible while removing all edges: equivalent to maximal independent set https://en.wikipedia.org/wiki/Maximal_independent_set problem. There seem to be a couple of better algorithms that can easily be implemented in Julia in the Wikipedia link. Still, considering what GCTA and PLINK are doing, I don't think we need that complicated algorithms.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/OpenMendel/SnpArrays.jl/issues/58#issuecomment-663981264, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGAPMPC6DOKV63IXHGSQS3R5QNXBANCNFSM4PHEQ2VQ .

kose-y commented 4 years ago

A related question is, if a user asks for a specific number of individuals to keep, the function finds an optimal (or nearly optimal) way to remove individuals such that the maximal entry of the remaining matrix is minimized. It seems a harder task.

It sounds like your heuristic above fits well with this problem. There also is a bottom-up approach, adding the lowest-degree node and eliminating its neighbors every step.

kose-y commented 4 years ago

We are considering creating a sparse graph while computing GRM, and pruning will be performed on the graph. Maybe by using LightGraphs.jl.

Hua-Zhou commented 4 years ago

I would let the pruning function take either an AbstractMatrix or AbstractString (GRM file of GCTA format) input. For now let's just implement something quick-and-dirty that assumes the input matrix can be held in memory. It covers most use cases.

On Fri, Jul 31, 2020 at 10:25 PM Seyoon Ko notifications@github.com wrote:

We are considering creating a sparse graph while computing GRM, and pruning will be performed on the graph. Maybe by using LightGraphs.jl.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/OpenMendel/SnpArrays.jl/issues/58#issuecomment-667473610, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGAPMPWNXQ3SZOS6HKJFFTR6ORLXANCNFSM4PHEQ2VQ .