I want to follow Minigraph-Cactus paper to make Figure 4 and Supplementary Figure 15

BlackSlipper commented 10 months ago

I want to follow Minigraph-Cactus paper, Supplementary Figure 15.

However, i couldn't find any details to count up the non-reference nodes in Minigraph-Cactus pangenome.

Did you use vg format to count the nodes or gfa format to count up the nodes?

I tried using cactus-hal2maf to convert into maf but HAL file resulting from MC pipeline only allowed me to find nodes that includes the reference.

I was wondering how you were able to make Fig.4a and Supplementary Fig.15.

It would be very kindful of you to explain the methods of this

Thank you in advance!

AndreaGuarracino commented 10 months ago

@BlackSlipper, not sure if the same functionalities are available in vg, but in odgi I've implemented ways to get the non-reference node IDs and the non-reference ranges.

odgi paths -i graph.gfa --non-reference-nodes reference_paths.txt > non-reference-node-ids.txt
odgi paths -i graph.gfa --non-reference-ranges reference_paths.txt > non-reference-ranges.bed

In reference_paths.txt you have to put the names of the paths that constitute your reference, one name for each line (for example, "grch38#1#chr1" if you have a human chromosome 1 pangenome graph and you use grch38 as reference).

BlackSlipper commented 10 months ago

Thank you @AndreaGuarracino for a quick reply.

I am currently using cactus v2.7.0 ( currently latest version) through docker.

However, odgi paths command in the docker image paths doesn't seem to have "--non-reference-nodes" and "--non-reference-ranges" options available.

`odgi paths {OPTIONS}

Interrogate the embedded paths of a graph. Does not print anything to stdout
by default!

OPTIONS:

  [ MANDATORY ARGUMENTS ]
    -i[FILE], --idx=[FILE]            Load the succinct variation graph in
                                      ODGI format from this *FILE*. The file
                                      name usually ends with *.og*. It also
                                      accepts GFAv1, but the on-the-fly
                                      conversion to the ODGI format requires
                                      additional time!
  [ Path Investigation Options ]
    -O[FILE], --overlaps=[FILE]       Read in the path grouping *FILE* to
                                      generate the overlap statistics from.
                                      The file must be tab-delimited. The
                                      first column lists a grouping and the
                                      second the path itself. Each line has
                                      one path entry. For each group the
                                      pairwise overlap statistics for each
                                      pairing will be calculated and printed
                                      to stdout.
    -L, --list-paths                  Print the paths in the graph to
                                      stdout. Each path is printed in its
                                      own line.
    -l, --list-path-start-end         If -L,--list-paths was specified, this
                                      additionally prints the start and end
                                      positions of each path in additional,
                                      tab-delimited coloumns.
    -f, --fasta                       Print paths in FASTA format to stdout.
                                      One line for the FASTA header, another
                                      line for the whole sequence.
    -H, --haplotypes                  Print to stdout the paths in a path
                                      coverage haplotype matrix based on the
                                      graph’s sort order. The output is
                                      tab-delimited: *path.name*,
                                      *path.length*, *path.step.count*,
                                      *node.1*, *node.2*, *node.n*. Each
                                      path entry is printed in its own line.
    -N, --scale-by-node-len           Scale the haplotype matrix cells by
                                      node length.
    -D[CHAR], --delim=[CHAR]          The part of each path name before this
                                      delimiter CHAR is a group identifier.
                                      For use with -H, --haplotypes**: it
                                      prints an additional, first column
                                      **group.name** to stdout.
    -p[N], --delim-pos=[N]            Consider the N-th occurrence of the
                                      delimiter specified with **-D,
                                      --delim** to obtain the group
                                      identifier. Specify 1 for the 1st
                                      occurrence (default).
  [ Path Modification Options ]
    -K[FILE], --keep-paths=[FILE]     Keep paths listed (by line) in *FILE*.
    -X[FILE], --drop-paths=[FILE]     Drop paths listed (by line) in *FILE*.
    -o[FILE], --out=[FILE]            Write the dynamic succinct variation
                                      graph to this file (e.g. *.og*).
  [ Threading ]
    -t[N], --threads=[N]              Number of threads to use for parallel
                                      operations.
  [ Processing Information ]
    -P, --progress                    Write the current progress to stderr.
  [ Program Information ]
    -h, --help                        Print a help message for odgi paths.`

would there be an option that i can use in the current docker image odgi?

glennhickey commented 10 months ago

Thanks @AndreaGuarracino !! This looks like useful functionality that at least a few people have been after lately.

The Cactus docker contains the latest odgi release. I guess I can switch this to the current master. @AndreaGuarracino any plans on making a new ODGI release soon? It looks like 0.8.3 may be pretty stale at this point?

@BlackSlipper In the meantime you can probably find a new odgi in the pggb docker. One thing to be careful of is that, by default, MC will make ODGI versions of the .full (unclipped graphs) that will contain unaligned centromeres. These may throw off your numbers unless you explicitly account for them.

You can make odgi versions of the clipped graphs using --odgi clip or --chrom-og clip but keep in mind:

odgi doesn't scale well with large numbers of paths (as arise due to clipped fragments)
odgi may not properly support supbpaths. May not be an issue here, but something to test.

But again, it should still be possible to use odgi on the full graph, then posprocess the results to mask out clipped regions. This could be an interesting tutorial to add the MC documentation.

AndreaGuarracino commented 10 months ago

Oops! I've just made a new ODGI release! https://github.com/pangenome/odgi/releases/tag/v0.8.4

ComparativeGenomicsToolkit / cactus

I want to follow Minigraph-Cactus paper to make Figure 4 and Supplementary Figure 15 #1257