SystemsGenetics / KINC

Knowledge Independent Network Construction
MIT License
11 stars 4 forks source link

Need Inequalities for the `run extract -filter-pvalue` #121

Closed JohnHadish closed 4 years ago

JohnHadish commented 4 years ago

The documentation for run extract's -filter-pvalue is currently this:

--filter-pvalue <value>
Value Type: String
Default Value: 1e-3
This is only used if a Condition-Specific Martrix is provided above and applies
to categorical, quantitative and ordinal tests. This filters the network such
that only edges (clusters) with p-values below the given values are kept.
Provide a single p-value to filter all features with the same value. However,
you can specify different p-values for different features. For example, suppose
you were tesing a categorical feature named 'Subspecies' with a category of
'Japonica' and you wanted edges with an p-value < 1e-3, you would input
"Subspecies,Japonica,1e-3". You can provide any number of filters but they must
be separated using two colons: "::".

It appears that there is no method in which a user can select "greater than value x for condition y".

This has application when a user wants edges significant in one condition, but not another. An application would be when considering "categorical" columns, where an edges can be significant in multiple categories. The user may be interested in edges that are significant in one of these categories, but not in another.

Example: Wheat grown in heat or in drought conditions. User wants to select edges that are correlated with heat, and not with drought.

Suggested Implementation: add Inequalities to the command:

Subspecies,Japonica,<,1e-3 Subspecies,Japonica,>,1e-3 Subspecies,Japonica,>=,1e-3 Subspecies,Japonica,<=,1e-3

spficklin commented 4 years ago

These exist now. They were added on a previous PR, but I don't remember which. When filtering by p-value or r-squared value you can specify gt or lt for greater-than or less-than respectively. They are not quite documented though. We'll have to fix that. It doesn't do gte or lte. But I'm not sure we need those. I'm leaving this open until the command-line documentation is fixed to document this.

spficklin commented 4 years ago

Okay the command-line help has been updated to:

Value Type: String
Default Value: 1e-3
This is only used if a Condition-Specific Martrix is provided using the --csm
argument and applies to categorical, quantitative and ordinal tests. This
filters the network such that only edges (clusters) with p-values below (or
above) the given values are kept. To filter all conditions, provide a single
p-value threshold. Additionally, you can specify different p-values for
different conditions. For example, suppose you were tesing a categorical
condition named 'Subspecies' with a label of 'Japonica' and you wanted edges
with an p-value < 1e-3, you would input "Subspecies,Japonica,1e-3". You can
specify to filter edges with a p-value greater or less using the "gt" or "lt"
tokens repectively (e.g. "Subspecies,Japonica,lt,1e-3"). You can provide any
number of filters but they must be separated using two colons: "::". The result
of providing multiple tests is a logical "and" (i.e. all tests for a condition
must pass for the edge to be included). If no value is provided for this
argument it defaults to "lt,1e-3".

I believe this resolves the issue. A PR will be merged shortly with this fix.