Open edawson opened 5 years ago
I’d suggest a command gfak subgraph
that can apply various filtering options specified at the command line, like -l
for minimum segment length, or --segment
to specify a list of segment names to keep(at the command line, or in a file).
Also useful to me would be keep all graph components with at least one segment that passes the filtering criteria.
For related tools,
see Bandage reduce
https://github.com/rrwick/Bandage/wiki/Command-line#bandage-reduce
and gfaview -s
https://github.com/lh3/gfa1
Usage: gfaview [options] <in.gfa>
Options:
General:
-v INT verbose level [2]
-1 only output CIGAR-M operators (for compatibility)
-u generate unitig graph (unambiguous merge)
Subgraph:
-s EXPR list of segment names to extract []
-S INT include neighbors in a radius [0]
-d EXPR list of segment names to delete []
Graph simplification:
-r transitive reduction
-R INT fuzzy length for -r [1000]
-t trim tips
-T INT tip length for -t [4]
-b pop bubbles
-B INT max bubble dist for -b [50000]
-o drop shorter overlaps
-O FLOAT dropped/longest<FLOAT, for -o [0.7]
-m misc trimming
Note: the order of options matters; one option may be applied >1 times.
❯❯❯ Bandage reduce ⏎ merge ◼
Bandage reduce takes an input graph and saves a reduced subgraph using the
graph scope settings. The saved graph will be in GFA format.
If a graph scope is not specified, then the 'entire' scope will be used, in
which case this will simply convert the input graph to GFA format.
Usage: Bandage reduce <inputgraph> <outputgraph> [options]
Positional parameters:
<inputgraph> A graph file of any type supported by Bandage
<outputgraph> The filename for the GFA graph to be made (if it
does not end in '.gfa', that extension will be
added)
Options: --help View this help message
--helpall View all command line settings
--version View Bandage version number
Settings: --scope <scope> Graph scope, from one of the following options:
entire, aroundnodes, aroundblast, depthrange
(default: entire)
--nodes <list> A comma-separated list of starting nodes for the
aroundnodes scope (default: none)
--partial Use partial node name matching (default: exact
node name matching)
--distance <int> The number of node steps away to draw for the
aroundnodes and aroundblast scopes (0 to 100,
default: 0)
--mindepth <float> The minimum allowed depth for the depthrange
scope (0 to 1e+6, default: 10)
--maxdepth <float> The maximum allowed depth for the depthrange
scope (0 to 1e+6, default: 100)
Online Bandage help: https://github.com/rrwick/Bandage/wiki
https://github.com/rrwick/Bandage/wiki/Command-line#bandage-reduce
Both tools allow select a radius of nodes around specified seeds, gfaview -S
and Bandage reduce --distance
. Setting that threshold to a large values implements the feature of selecting components that have at least one segment passing the criteria.
I can add the radius feature - right now, only large nodes get preserved in output.
Thanks, Eric!
@sjackman wants a tool to remove segments with length below a threshold. This should be pretty easy to implement. The right way would be to allow filtering on graph ingestion to avoid having to allocate structs for too-short segments, but for now we can just delete them, their edges, and their path entries from the internal containers.