SystemsGenetics / KINC

Knowledge Independent Network Construction
MIT License
11 stars 4 forks source link

Issues with Functionality of "Condition-Specific Subgraphs" #118

Closed JohnHadish closed 4 years ago

JohnHadish commented 4 years ago

I have a few issues with the functionality of new KINC with it's Condition Specific Subgraphs

Determining which edges are included is controlled by the --filter-pvalue command, but it is not very well documented. Currently the documentation indicates that you should use a numeric pvalue:

kinc run extract \
  --emx "rice_heat_drought.GEM.FPKM.filtered.emx" \
  --ccm "rice_heat_drought.GEM.FPKM.filtered.paf.ccm" \
  --cmx "rice_heat_drought.GEM.FPKM.filtered.paf.cmx" \
  --csm "rice_heat_drought.GEM.FPKM.filtered.paf.csm" \
  --format "text" \
  --output "rice_heat_drought.GEM.FPKM.filtered.th0.5.cs1e-3.gcn.txt" \
  --mincorr 0.80 \
  --maxcorr 1 \
  --filter-pvalue "1e-3"
  --filter-rsquare "0.3"

But it appears that you can actually use a triple argument that allows you to extract edges based on column and number.

Also, the --filter-pvalue appears to have a default pvalue, but this is never mentioned. I feel that if you do not include a value, that it should default extract everything, not extract at an arbitrary number.

In addition, it appears that you can not extract edges based on "greater than x for condition y, and less than x for condition z" This is a common protocol we would use in KINC.r. For example, "extract all edges that are significant for heat, and not significant for control"

spficklin commented 4 years ago

The comments for both the r-squared and p-value filters have been updated to what appears below. I left the defaults for both because otherwise the resulting network could be astronomically huge! The PR with the fix will be submitted and merged shortly.

--filter-pvalue <value>
Value Type: String
Default Value: 1e-3
This is only used if a Condition-Specific Martrix is provided using the --csm
argument and applies to categorical, quantitative and ordinal tests. This
filters the network such that only edges (clusters) with p-values below (or
above) the given values are kept. To filter all conditions, provide a single
p-value threshold. Additionally, you can specify different p-values for
different conditions. For example, suppose you were tesing a categorical
condition named 'Subspecies' with a label of 'Japonica' and you wanted edges
with an p-value < 1e-3, you would input "Subspecies,Japonica,1e-3". You can
specify to filter edges with a p-value greater or less using the "gt" or "lt"
tokens repectively (e.g. "Subspecies,Japonica,lt,1e-3"). You can provide any
number of filters but they must be separated using two colons: "::". The result
of providing multiple tests is a logical "and" (i.e. all tests for a condition
must pass for the edge to be included). If no value is provided for this
argument it defaults to "lt,1e-3".

--filter-rsquare <value>
Value Type: String
Default Value: 0.3
This is only used if a Condition-Specific Martrix is provided using the --csm
argument and applies to quantitative and ordinal tests. This filters the network
such that only edges (clusters) with r-squared values derived from linear
regression testing that are above (or below) the given values are kept. To
filter all conditions, provide a single r-squared value. By default any
r-squared value above the value will be kept. Additionally, you can specify
different r-squared values for different conditions. For example, suppose you
were tesing a quantiative condition named 'Weight' and you wanted edges with an
r-squared value > 0.5, you would input "Weight,0.5". You can specify to filter
edges greater or less using the "gt" or "lt" tokens repectively (e.g.
"Weight,gt,0.5"). You can provide any number of filters but they must be
separated using two colons: "::". The result of providing multiple tests is a
logical "and" (i.e. all tests for a condition must pass for the edge to be
included). If no value is provided for this argument it defaults to "gt,0.3".