arq5x / bedtools

A powerful toolset for genome arithmetic.
http://code.google.com/p/bedtools/
GNU General Public License v2.0
139 stars 86 forks source link

Online documentation not up-to-date #121

Closed blaiseli closed 7 years ago

blaiseli commented 7 years ago

I was trying to understand how to use bedtools groupby based on what looks like an official documentation page for the latest version (v2.26.0): http://bedtools.readthedocs.io/en/latest/content/tools/groupby.html

It appears that there are more options to the command, as I can infer from the comments and from the output of the command itself when some options are missing:

$ head -6 /tmp/bed_with_gene_ids.bed | bedtools groupby -g 4
***** ERROR: -opCols parameter requires a value.

Tool:    bedtools groupby 
Version: v2.26.0
Summary: Summarizes a dataset column based upon
     common column groupings. Akin to the SQL "group by" command.

Usage:   bedtools groupby -g [group_column(s)] -c [op_column(s)] -o [ops] 
     cat [FILE] | bedtools groupby -g [group_column(s)] -c [op_column(s)] -o [ops] 

Options: 
    -i      Input file. Assumes "stdin" if omitted.

    -g -grp     Specify the columns (1-based) for the grouping.
            The columns must be comma separated.
            - Default: 1,2,3

    -c -opCols  Specify the column (1-based) that should be summarized.
            - Required.

    -o -ops     Specify the operation that should be applied to opCol.
            Valid operations:
                sum, count, count_distinct, min, max,
                mean, median, mode, antimode,
                stdev, sstdev (sample standard dev.),
                collapse (i.e., print a comma separated list (duplicates allowed)), 
                distinct (i.e., print a comma separated list (NO duplicates allowed)), 
                distinct_sort_num (as distinct, but sorted numerically, ascending), 
                distinct_sort_num_desc (as distinct, but sorted numerically, descending), 
                concat   (i.e., merge values into a single, non-delimited string), 
                freqdesc (i.e., print desc. list of values:freq)
                freqasc (i.e., print asc. list of values:freq)
                first (i.e., print first value)
                last (i.e., print last value)
            - Default: sum

        If there is only column, but multiple operations, all operations will be
        applied on that column. Likewise, if there is only one operation, but
        multiple columns, that operation will be applied to all columns.
        Otherwise, the number of columns must match the the number of operations,
        and will be applied in respective order.
        E.g., "-c 5,4,6 -o sum,mean,count" will give the sum of column 5,
        the mean of column 4, and the count of column 6.
        The order of output columns will match the ordering given in the command.

    -full       Print all columns from input file.  The first line in the group is used.
            Default: print only grouped columns.

    -inheader   Input file has a header line - the first line will be ignored.

    -outheader  Print header line in the output, detailing the column names. 
            If the input file has headers (-inheader), the output file
            will use the input's column names.
            If the input file has no headers, the output file
            will use "col_1", "col_2", etc. as the column names.

    -header     same as '-inheader -outheader'

    -ignorecase Group values regardless of upper/lower case.

    -prec   Sets the decimal precision for output (Default: 5)

    -delim  Specify a custom delimiter for the collapse operations.
        - Example: -delim "|"
        - Default: ",".

Examples: 
    $ cat ex1.out
    chr1 10  20  A   chr1    15  25  B.1 1000    ATAT
    chr1 10  20  A   chr1    25  35  B.2 10000   CGCG

    $ groupBy -i ex1.out -g 1,2,3,4 -c 9 -o sum
    chr1 10  20  A   11000

    $ groupBy -i ex1.out -grp 1,2,3,4 -opCols 9,9 -ops sum,max
    chr1 10  20  A   11000   10000

    $ groupBy -i ex1.out -g 1,2,3,4 -c 8,9 -o collapse,mean
    chr1 10  20  A   B.1,B.2,    5500

    $ cat ex1.out | groupBy -g 1,2,3,4 -c 8,9 -o collapse,mean
    chr1 10  20  A   B.1,B.2,    5500

    $ cat ex1.out | groupBy -g 1,2,3,4 -c 10 -o concat
    chr1 10  20  A   ATATCGCG

Notes: 
    (1)  The input file/stream should be sorted/grouped by the -grp. columns
    (2)  If -i is unspecified, input is assumed to come from stdin.

By the way, the comments on the documentation page seem to report bugs. I wasn't able to figure out whether it was just due to them not using the correct options or whether it was genuine bugs.

arq5x commented 7 years ago

You need to specify at least one column for the -c option.

Bugs have been fixed in the github repo and will be part of the next release.