SystemsGenetics / KINC

Knowledge Independent Network Construction
MIT License
11 stars 4 forks source link

Filtering by rank, top_n argument is confusing #177

Closed spficklin closed 4 years ago

spficklin commented 4 years ago

Issue

This fixes issue #155 raised by @JohnHadish. Previously, the top_n argument of the kinc-filter-rank.R script retrieved all edges that have a rank less than or equal to the specified top_n argument. Because edges can have the same rank this meant you would sometimes get a lot more edges than you bargained for and would make working with the resulting network difficult if it was bigger than you expected.

The Fix

This PR adjust the script so that the actual number of edges specified by top_n are returned. So for example, if you set a top_n of 20000 then the top 20,000 edges for each condition will be returned.

Also, the kinc-filter-rank.R script has a nice update with this PR, in that if it saves the ranked network in an .RData file so that each time you run it, it no longer has to calculate the ranks each time. That saves a lot of time.

You have to have dplyr >= 1.0.0 or you might get an error about splice_head not being found. If you get the error, just update dplyr in R.

How to Test

To test, run the example data. using the kinc-gmm-run.sh script. Then go into the results directory and run the following to retrieve all edges.

kinc-filter-rank.R \
    --net "PRJNA301554.slim.GEM.log2.paf-th0.00-p1e-3-rsqr0.30-filtered.GCN.txt" \
    --out_prefix "PRJNA301554.slim.GEM.log2.paf-th0.00-p1e-3-rsqr0.30-filtered" \
    --top_n 20000

kinc-filter-rank.R \
    --net "PRJNA301554.slim.GEM.log2.paf-th0.00-p1e-3-rsqr0.30-filtered.GCN.txt" \
    --out_prefix "PRJNA301554.slim.GEM.log2.paf-th0.00-p1e-3-rsqr0.30-filtered" \
    --save_condition_networks \
    --top_n 20000

kinc-filter-rank.R \
    --net "PRJNA301554.slim.GEM.log2.paf-th0.00-p1e-3-rsqr0.30-filtered.GCN.txt" \
    --out_prefix "PRJNA301554.slim.GEM.log2.paf-th0.00-p1e-3-rsqr0.30-filtered" \
    --save_condition_networks --unique_filter "label" \
    --top_n 20000

kinc-filter-rank.R \
    --net "PRJNA301554.slim.GEM.log2.paf-th0.00-p1e-3-rsqr0.30-filtered.GCN.txt" \
    --out_prefix "PRJNA301554.slim.GEM.log2.paf-th0.00-p1e-3-rsqr0.30-filtered" \
    --save_condition_networks --unique_filter "class" \
    --top_n 20000

The run the following to just get 20 edges:

kinc-filter-rank.R \
    --net "PRJNA301554.slim.GEM.log2.paf-th0.00-p1e-3-rsqr0.30-filtered.GCN.txt" \
    --out_prefix "PRJNA301554.slim.GEM.log2.paf-th0.00-p1e-3-rsqr0.30-filtered" \
    --top_n 20

kinc-filter-rank.R \
    --net "PRJNA301554.slim.GEM.log2.paf-th0.00-p1e-3-rsqr0.30-filtered.GCN.txt" \
    --out_prefix "PRJNA301554.slim.GEM.log2.paf-th0.00-p1e-3-rsqr0.30-filtered" \
    --save_condition_networks \
    --top_n 20

kinc-filter-rank.R \
    --net "PRJNA301554.slim.GEM.log2.paf-th0.00-p1e-3-rsqr0.30-filtered.GCN.txt" \
    --out_prefix "PRJNA301554.slim.GEM.log2.paf-th0.00-p1e-3-rsqr0.30-filtered" \
    --save_condition_networks --unique_filter "label" \
    --top_n 20

kinc-filter-rank.R \
    --net "PRJNA301554.slim.GEM.log2.paf-th0.00-p1e-3-rsqr0.30-filtered.GCN.txt" \
    --out_prefix "PRJNA301554.slim.GEM.log2.paf-th0.00-p1e-3-rsqr0.30-filtered" \
    --save_condition_networks --unique_filter "class" \
    --top_n 20

Look at the resulting files and you should in the second run only 20 edges per condition (at most, less if fewer than that existed).

spficklin commented 4 years ago

@JohnHadish can you please review this since you posted the issue?