Closed Hua-Zhou closed 4 years ago
Link to relevant GCTA code: https://github.com/cooljeanius/gcta/blob/e2a2bfb116ed018ba6a9fe733035bca01055eeca/grm.cpp#L450
and PLINK code: https://github.com/chrchang/plink-ng/blob/1eb693891de475a180d04806e75cf734b7ded3ac/2.0/plink2_matrix_calc.cc#L290 (whew, this thing is full of goto statements)
Here is what GCTA does, in graph-theoretic terminology (node=sample, edge=off-diagonal entry higher than the threshold, degree=number of off-diagonal entries higher than the threshold):
Here is what PLINK does (from line 174 of the linked file):
The goal here is to keep as many nodes as possible while removing all edges: equivalent to maximum independent set problem. There seem to be a couple of better algorithms that can be implemented in Julia in the Wikipedia link. Still, considering what GCTA and PLINK are doing, I don't think we need that complicated and probably slower algorithms.
Thanks for the nice summary, @kose-y! It's worth trying both Plink and GCTA strategies to see which one gives better results (keep more individuals). A related question is, if a user asks for a specific number of individuals to keep, the function finds an optimal (or nearly optimal) way to remove individuals such that the maximal entry of the remaining matrix is minimized. It seems a harder task.
We may open another issue regarding LD pruning of SNPs so that remaining SNPs are not very highly related. A literature review of how Plink does it is helpful.
On Sun, Jul 26, 2020 at 5:22 AM Seyoon Ko notifications@github.com wrote:
The goal here is to keep as many nodes as possible while removing all edges: equivalent to maximal independent set https://en.wikipedia.org/wiki/Maximal_independent_set problem. There seem to be a couple of better algorithms that can easily be implemented in Julia in the Wikipedia link. Still, considering what GCTA and PLINK are doing, I don't think we need that complicated algorithms.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/OpenMendel/SnpArrays.jl/issues/58#issuecomment-663981264, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGAPMPC6DOKV63IXHGSQS3R5QNXBANCNFSM4PHEQ2VQ .
A related question is, if a user asks for a specific number of individuals to keep, the function finds an optimal (or nearly optimal) way to remove individuals such that the maximal entry of the remaining matrix is minimized. It seems a harder task.
It sounds like your heuristic above fits well with this problem. There also is a bottom-up approach, adding the lowest-degree node and eliminating its neighbors every step.
We are considering creating a sparse graph while computing GRM, and pruning will be performed on the graph. Maybe by using LightGraphs.jl.
I would let the pruning function take either an AbstractMatrix
or
AbstractString
(GRM file of GCTA format) input. For now let's just
implement something quick-and-dirty that assumes the input matrix can be
held in memory. It covers most use cases.
On Fri, Jul 31, 2020 at 10:25 PM Seyoon Ko notifications@github.com wrote:
We are considering creating a sparse graph while computing GRM, and pruning will be performed on the graph. Maybe by using LightGraphs.jl.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/OpenMendel/SnpArrays.jl/issues/58#issuecomment-667473610, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABGAPMPWNXQ3SZOS6HKJFFTR6ORLXANCNFSM4PHEQ2VQ .
It's a common task in the genetic analysis workflow to prune kinship matrix to eliminate highly related individuals according to a threshold of kinship values. GCTA recommends a default threshold of 0.025 (2nd degree cousin, translated to 0.0125 in our notation).
The function interface can be simple as
The output is a
BitVector
such thatΦ[keepidx, keepidx]
has no off-diagonal entries abovekinship_threshold
.A simple heuristic algorithm can be:
keepidx
to be alltrue
s.kinship_thresold
per column.kinship_threshold
and set the correspondingkeepidx
entry tofalse
.Any reference to how current GCTA and PLINK do it will be helpful.