edawson / gfakluge

A C++ library and utilities for manipulating the Graphical Fragment Assembly format.
http://edawson.github.io/gfakluge/
MIT License
51 stars 20 forks source link

[Feature request] GFAK trim - trim a graph by segment length #43

Open edawson opened 5 years ago

edawson commented 5 years ago

@sjackman wants a tool to remove segments with length below a threshold. This should be pretty easy to implement. The right way would be to allow filtering on graph ingestion to avoid having to allocate structs for too-short segments, but for now we can just delete them, their edges, and their path entries from the internal containers.

sjackman commented 5 years ago

I’d suggest a command gfak subgraph that can apply various filtering options specified at the command line, like -l for minimum segment length, or --segment to specify a list of segment names to keep(at the command line, or in a file).

Also useful to me would be keep all graph components with at least one segment that passes the filtering criteria.

For related tools, see Bandage reduce https://github.com/rrwick/Bandage/wiki/Command-line#bandage-reduce and gfaview -s https://github.com/lh3/gfa1

sjackman commented 5 years ago
Usage: gfaview [options] <in.gfa>
Options:
  General:
    -v INT      verbose level [2]
    -1          only output CIGAR-M operators (for compatibility)
    -u          generate unitig graph (unambiguous merge)
  Subgraph:
    -s EXPR     list of segment names to extract []
    -S INT      include neighbors in a radius [0]
    -d EXPR     list of segment names to delete []
  Graph simplification:
    -r          transitive reduction
    -R INT      fuzzy length for -r [1000]
    -t          trim tips
    -T INT      tip length for -t [4]
    -b          pop bubbles
    -B INT      max bubble dist for -b [50000]
    -o          drop shorter overlaps
    -O FLOAT    dropped/longest<FLOAT, for -o [0.7]
    -m          misc trimming
Note: the order of options matters; one option may be applied >1 times.
sjackman commented 5 years ago
❯❯❯ Bandage reduce                                  ⏎ merge ◼

Bandage reduce takes an input graph and saves a reduced subgraph using the
graph scope settings. The saved graph will be in GFA format.

If a graph scope is not specified, then the 'entire' scope will be used, in
which case this will simply convert the input graph to GFA format.

Usage:    Bandage reduce <inputgraph> <outputgraph> [options]

Positional parameters:
          <inputgraph>        A graph file of any type supported by Bandage
          <outputgraph>       The filename for the GFA graph to be made (if it
                              does not end in '.gfa', that extension will be
                              added)

Options:  --help              View this help message
          --helpall           View all command line settings
          --version           View Bandage version number

Settings: --scope <scope>     Graph scope, from one of the following options:
                              entire, aroundnodes, aroundblast, depthrange
                              (default: entire)
          --nodes <list>      A comma-separated list of starting nodes for the
                              aroundnodes scope (default: none)
          --partial           Use partial node name matching (default: exact
                              node name matching)
          --distance <int>    The number of node steps away to draw for the
                              aroundnodes and aroundblast scopes (0 to 100,
                              default: 0)
          --mindepth <float>  The minimum allowed depth for the depthrange
                              scope (0 to 1e+6, default: 10)
          --maxdepth <float>  The maximum allowed depth for the depthrange
                              scope (0 to 1e+6, default: 100)

Online Bandage help: https://github.com/rrwick/Bandage/wiki

https://github.com/rrwick/Bandage/wiki/Command-line#bandage-reduce

sjackman commented 5 years ago

Both tools allow select a radius of nodes around specified seeds, gfaview -S and Bandage reduce --distance. Setting that threshold to a large values implements the feature of selecting components that have at least one segment passing the criteria.

edawson commented 5 years ago

I can add the radius feature - right now, only large nodes get preserved in output.

sjackman commented 5 years ago

Thanks, Eric!