COMBINE-lab / cuttlefish

Building the compacted de Bruijn graph efficiently from references or reads.
BSD 3-Clause "New" or "Revised" License
83 stars 9 forks source link

Complemental k-mers are not handled correctly #35

Closed sebschmi closed 1 year ago

sebschmi commented 1 year ago

input (the first k-mer is the reverse complement of the second):

>two_complemental_kmers
AAAAGCCTGAGAAGATATCTTCTCAGGCTTTT

command:

$ cuttlefish build -s test.fa -k 31 -t 8  -o test.cuttlefish.fa -w cuttlefish_tmp --ref -c 1

Constructing the compacted reference de Bruijn graph for k = 31.

Enumerating the edges of the de Bruijn graph.
**

Structural information for the de Bruijn graph is written to test.cuttlefish.fa.json.
Error: Cannot open temporary file ./kmc_01021.bin

Usage :
[...]

No output file produced. Expected output file e.g.:

>0
AAAAGCCTGAGAAGATATCTTCTCAGGCTTT

version:

$ cuttlefish --version
cuttlefish 2.2.0
Supported commands: `build`, `help`, `version`.
Usage:
    cuttlefish build [options]

Apparently, complemental k-mers are not treated correctly.

rob-p commented 1 year ago

It seems there was an error opening the output file?

Error: Cannot open temporary file ./kmc_01021.bin
rob-p commented 1 year ago

Complementary k-mers and self-complementary loops should be handled properly within Cuttlefish. When I run this example I get the following outputs (which seem correct).

test.cuttlefish.fa.fa

>0
AAAAGCCTGAGAAGATATCTTCTCAGGCTTT

test.cuttlefish.fa.json

{
    "parameters info": {
        "input": "complementary.fa",
        "k": 31,
        "output prefix": "test.cuttlefish.fa"
    },
    "basic info": {
        "vertex count": 1,
        "edge count": 1
    },
    "contigs info": {
        "maximal unitig count": 1,
        "vertex count in the maximal unitigs": 1,
        "shortest maximal unitig length": 31,
        "longest maximal unitig length": 31,
        "sum maximal unitig length": 31,
        "avg. maximal unitig length": 31,
        "_comment": "lengths are in bases"
    },
    "detached chordless cycles (DCC) info": {
        "DCC count": 0
    }
}
jamshed commented 1 year ago

Hi @sebschmi:

It seems like a problem with opening enough temporary files in your platform. Please see the solution here: https://github.com/COMBINE-lab/cuttlefish#note.

Thanks.

sebschmi commented 1 year ago

Thank you for the quick reply! Changing the ulimit actually fixed the problem for this case, but this was only the result of reducing the size of a bigger example. The bigger example was run with the correct ulimit, does not produce an error, but misses some kmers.

I looked into this with a bit more detail now, and it seems like the issue is that cuttlefish applies the "cutoff" to (k+1)-mers instead of k-mers. But other tools like e.g. bcalm, bifrost and ggcat apply the cutoff to k-mers.

Is there a way to build a de Bruijn graph of order k with a cutoff applied to the k-mers with cuttlefish?

jamshed commented 1 year ago

Hi @sebschmi: sorry for the delay in reply! Due to the design of the algorithm, Cuttlefish 2, as is, is not able to construct the de Bruijn graph with thresholds on the k-mer frequencies instead of on the (k + 1)-mer frequencies.