COMBINE-lab / cuttlefish

Building the compacted de Bruijn graph efficiently from references or reads.
BSD 3-Clause "New" or "Revised" License
83 stars 9 forks source link

Unitigs are missing kmers #36

Closed sebschmi closed 1 year ago

sebschmi commented 1 year ago

I executed cuttlefish 2.2.0 as follows:

cuttlefish build -s "$DIR/references.fa" -k 31 -t 28  -o "$DIR/unitigs" -w "$DIR/workdir" --ref -c 1 2>&1

The dataset is available at: https://helsinkifi-my.sharepoint.com/:u:/g/personal/sebschmi_ad_helsinki_fi/EbVAH2jJ_z5Dn43CATYWsbsBdjeSamQJcsMBX30dw07bNg?e=UCOC6W

Cuttlefish log ``` Constructing the compacted reference de Bruijn graph for k = 31. Enumerating the edges of the de Bruijn graph. ********************************************************************************************************************************************* Stage 1: 100% Warning: using counter_max == 1 will cause not storying counters in KMC output file, all counters will be assumed to be 1. This is experimental and is not currently supported in kmc_tools. Will be implemented soon. Stage 2: 100% 32-mer enumeration statistics: Number of sequences: 745409. Total sequence length: 18912222395. Total number of 32-mers: 18889114716. Number of unique 32-mers: 173831316. Number of solid 32-mers: 173831316. Enumerated the edge set of the graph. Time taken = 128.306 seconds. Enumerating the vertices of the de Bruijn graph. Stage 1: 100% Warning: using counter_max == 1 will cause not storying counters in KMC output file, all counters will be assumed to be 1. This is experimental and is not currently supported in kmc_tools. Will be implemented soon. Stage 2: 100% [Building BooPHF] 47.5 % elapsed: 0 min 1 sec remaining: 0 min 1 secEnumerated the vertex set of the graph. Time taken = 12.8121 seconds. Number of edges: 173831316. Number of vertices: 170648589. Constructing the minimal perfect hash function (MPHF) over the vertex set. Total number of k-mers in the set (KMC database): 170648589. Building the MPHF from the k-mer database /grp-abga/hdd/sebschmi/matchtigs/cuttlefish_build-2023-10-27T11-13-16-7oGlUG23gZ/workdir/unitigs.cf_V. Using gamma = 10. Completed a pass over the k-mer database. [Building BooPHF] 94.9 % elapsed: 0 min 2 sec remaining: 0 min 0 sec Completed a pass over the k-mer database. [Building BooPHF] 100 % elapsed: 0 min 3 sec remaining: 0 min 0 sec Built the MPHF in memory. Total MPHF size: 252 MB. Bits per k-mer: 12.4332. Allocated hash table buckets for the k-mers. Total size: 122 MB. Total memory usage by the hash table: 374 MB. Bits per k-mer: 18.4332. Constructed the minimal perfect hash function for the vertices. Time taken = 2.54308 seconds. Computing the DFA states. Total number of distinct edges: 173831316. [Computing DFA states] 99% Number of processed edges: 173831316 Completed a pass over the k-mer database. Computed the states of the automata. Time taken = 4.65436 seconds. Extracting the maximal unitigs. Number of distinct vertices: 170648589. [Extracting maximal unitigs] 100% Number of scanned vertices: 170648589. Number of maximal unitigs: 8896894. Number of k-mers in the maximal unitigs: 170648589. Length of the longest maximal unitig (in bases): 72441. Length of the shortest maximal unitig (in bases): 31. Sum length of the maximal unitigs (in bases): 437555409. There are Detached Chordless Cycles (DCC) present in the graph: DCC count: 8. Number of vertices in the DCCs: 38004. Sum length of the DCCs (in bases): 38244. Extracted the paths. Time taken = 5.91115 seconds. Completed a pass over the k-mer database. Extracted the paths. Time taken = 6.011 seconds. Maximum temporary disk-usage: 17.9225GB. Structural information for the de Bruijn graph is written to /grp-abga/hdd/sebschmi/matchtigs/cuttlefish_build-2023-10-27T11-13-16-7oGlUG23gZ/unitigs.json. Constructed the reference compacted de Bruijn graph at /grp-abga/hdd/sebschmi/matchtigs/cuttlefish_build-2023-10-27T11-13-16-7oGlUG23gZ/unitigs. ```
The JSON file ``` { "parameters info": { "input": "/grp-abga/hdd/sebschmi/matchtigs/cuttlefish_build-2023-10-27T11-13-16-7oGlUG23gZ/references.fa", "k": 31, "output prefix": "/grp-abga/hdd/sebschmi/matchtigs/cuttlefish_build-2023-10-27T11-13-16-7oGlUG23gZ/unitigs" }, "basic info": { "vertex count": 170648589, "edge count": 173831316 }, "contigs info": { "maximal unitig count": 8896894, "vertex count in the maximal unitigs": 170648589, "shortest maximal unitig length": 31, "longest maximal unitig length": 72441, "sum maximal unitig length": 437555409, "avg. maximal unitig length": 49, "_comment": "lengths are in bases" }, "detached chordless cycles (DCC) info": { "DCC count": 8, "vertex count in the DCCs": 38004, "sum DCC length (in bases)": 38244 } } ```

The kmer AACAATGCTGGTATGAAAGAAACATGGCAGG as well as its reverse complement CCTGCCATGTTTCTTTCATACCAGCATTGTT are missing from the cuttlefish unitigs, but present in the input.

Did I use some wrong parameters for cuttlefish?

rob-p commented 1 year ago

Hi @sebschmi,

Thanks for the detailed report! So digging around a bit, I think this may be related to and edge/vertex distinction in how Cuttlefish 2 operates (and in how it is different from Cuttlefish 1).

I confirmed that, as with your cuttlefish command, the resulting unitigs do not have this k-mer (or its rc). Likewise, I confirmed that if I run KMC to count the 31-mer (with no threshold), and dump this database, the k-mer does appear in the output (with a count of 1).

However, if you run KMC with a k-mer length of 32, you'll notice that no 32-mer containing this k-mer appears in the input. That is, if you build the 32-mer database (the length the cuttlefish 2 will use for edges when it constructs the graph with vertices of length 31), and then dump it, there will be no occurrence of the k-mer you point out or its reverse complement; both AACAATGCTGGTATGAAAGAAACATGGCAGG and CCTGCCATGTTTCTTTCATACCAGCATTGTT are missing from any 32-mer.

In order for a vertex (31-mer) to be present in the output of cuttlefish 2, it must be present as an endpoint of some valid edge (32-mer). But this k-mer is not. I believe this explains why it is missing from the unitig set. This is the expected behavior of the algorithm, under the definition provided in the paper, if I am recalling correctly, but I will ping @jamshed for his thoughts on this.

On the other hand, I'm fairly certain that if we build the unitigs on this dataset using the cuttlefish 1 algorithm (since it is, after all, a reference collection and not a read set), that we will observe this k-mer in the resulting unitigs. I will confirm this and report back here when I have the answer.

Thanks! Rob

rob-p commented 1 year ago

@sebschmi & @jamshed,

Indeed, I can confirm that, when building using the vertex-centric approach in cuttlefish 1, the given 31-mer unitig is present:

✦10 ❯ rg AACAATGCTGGTATGAAAGAAACATGGCAGG /mnt/scratch1/sebastian_ecoli_data/unitigs_cf1.gfa1
4827650301:S    120194931       AACAATGCTGGTATGAAAGAAACATGGCAGG LN:i:31

--Rob

sebschmi commented 1 year ago

Great, thanks for the answer. Then I will use cuttlefish 1 instead of 2.

jamshed commented 1 year ago

Thanks @sebschmi for the detailed report, and thanks @rob-p for digging into this! Yes, k-mers would be missing from the output if they are completely isolated in the de Bruijn graph, due to the reasons @rob-p discussed—Cuttlefish 2 deduces the vertex set of the graph from the edge set (i.e. (k + 1)-mers), and hence k-mers that do not participate in any (k + 1)-mer would be missing in the vertex set.