COMBINE-lab / cuttlefish

Building the compacted de Bruijn graph efficiently from references or reads.
BSD 3-Clause "New" or "Revised" License
83 stars 9 forks source link

Empty GFA (and temporary file delete error) #3

Closed rickbeeloo closed 5 months ago

rickbeeloo commented 4 years ago

First of all, cool tool!

We're trying it to run for a large collection of genomes and first built the k-mer database using: kmc -k31 -m450 -fm -t30 -ci0 @paths.txt test_graph . And then ran cuttlefish: cuttlefish build -k 31 -s test_graph -o cuttle_graph -f 2 -t 30

This is the corresponding output:

Constructing the minimal perfect hash function. Total number of k-mers in the set (KMC database): 9630544516. Building the MPH function from the k-mer database test_graph for info, total work write each : 2.649 total work inram from level 4 : 5.479 total work raw : 25.000 [Building BooPHF] 36.5 % elapsed: 16 min 2 sec remaining: 27 min 54 sec Completed a pass over the k-mer database. [Building BooPHF] 75.5 % elapsed: 22 min 17 sec remaining: 7 min 15 sec Completed a pass over the k-mer database. [Building BooPHF] 100 % elapsed: 34 min 33 sec remaining: 0 min 0 sec Built the MPH function in memory. Total data copy time to BBHash buffers 1596.47

Total MPH size (in MB): 4258; in bits / k-mer: 3.70962 Allocated hash table buckets for the k-mers. Total memory taken (in MB): 5740 Total memory usage by the hash table: 9999 MB. Bits per k-mer: 8.70962 Done allocating the hash table. Time taken = 2083.64 seconds.

Classifying the vertices. Processed 0 sequences. Total reference length is 0 bases. Maximum buffer size used (in MB): 0 Done classifying the vertices. Time taken = 0.00212339 seconds.

Outputting the maximal unitigs. Temporary path file name prefixes: ./cuttlefish-path-output-SoO15XQt3S Processed 0 sequences. Total reference length is 0 bases.cuttlefish build -k 31 -s test_graph -o cuttle_graph_text -f 0 -t 30 Maximum seqeunce buffer size used (in MB): 0 Error deleting temporary files. Aborting

The resulting GFA file only has the header (H VN:Z:2.0) any idea what's going wrong here? When using the default output (0) there is no temporary file error however the output is also empty

jamshed commented 4 years ago

Hello @rickbeeloo,

Glad that you found Cuttlefish useful! About the empty output issue, I see that you did not provide any reference files to the build command; although, the reference file(s) are required by Cuttlefish. Theoretically, Cuttlefish compacts an edge-centric de Bruijn graph, and needs the (k + 1)-mers from the input to work with; and that is why the references are required. Cuttlefish accepts the input reference collection in three format (-r, -l, -d). And since it is possible to input the collection mixing these parameters, we kept the parameters optional. Although, we should add an empty input collection checker so that the build process warns the user of the issue.

In brief, I see that you built the KMC database over a list paths.txt of references. So, you should add -l paths.txt to the parameters list in the cuttlefish build command, and it should work. Please let us know if this fixes the problem, and if yes, then please close the issue.

On a different note, this issue emphasizes that we need to have some toy example of the full build pipeline to help users interpret the parameters correctly. Thanks for the issue. Regards!

EA2106-Universite-Francois-Rabelais commented 4 years ago

Hi, I have a related question: how to you see "colors" in the final gfa1/gfa2 output? I ran cuttelfish with -l paths.txt, successfully producing a large gfa file. How can we know that a given segment is specific or shared among references? Thanks!

jamshed commented 4 years ago

Hi!

I'm paraphrasing @rob-p on this:

We did struggle a bit regarding the best terminology here (for "colors"). Since cuttlefish isn’t an index, nothing is indexed per-se (specifically, we don't associate the color-information with each segment). When we build the compacted graph over multiple references, we consider it colored in the sense that the color information is encoded in the P (GFA1) or O (GFA2) entries.

To answer your query on how to know if some segment with id x is shared among multiple references or not — each reference has a corresponding path-tiling in the GFA format (with the tag P or O based on the version), and that specific segment is present in those references whose paths contain the id x.

Regards!

rob-p commented 4 years ago

Thanks for the question @EA2106-Universite-Francois-Rabelais!

Relatedly, it would be possible in a post-processing step to build an index that would let you query with a contig (segment) and return the reference list in which that segment occurred. Would this type of index be useful to have in your use case? If so what would an ideal interface look like for this functionality?

rickbeeloo commented 4 years ago

@jamshed-k @rob-p I would love to extend a little bit on this. It depends on what @EA2106-Universite-Francois-Rabelais means with "segment". If this is the S tag in the GFA this would be relatively straightforward as @jamshed-k points out, you can use a simple set intersection algorithm to find shared and unique segments based on P/O. However this gets more complicated when we are interested in finding large shared sequences (i.e. multiple segments between genomes), such as:

G1: S1, S2, S3, S4
G2: S0, S2, S3, S10

Then it would be cool to get a list saying S2, S3 found in G1 and G2 (with ideally the positions too). This can be defined as the continuous frequent itemset problem in pan genome graphs (see this paper more detail). To find such shared regions we have tools such as Sibelia, SibeliaZ and FindFRs. For our dataset, consisting of diverse bacterial genomes, we find large blocks using Sibelia but it's terribly slow, SibeliaZ is fast but produces much shorter blocks. FindFR on the other hand is super fast and allows us to tolerate insertions but unfortunately cannot be used with the output of recent tools (see my discussion here)

EA2106-Universite-Francois-Rabelais commented 4 years ago

Thanks @rob-p , @rickbeeloo & @jamshed-k for the interesting discussion. I am new to the gfa file formats and looking for ways to extract relevant biological information. I agree with @rickbeeloo regarding the access to a list of shared segments. I suppose that an index as suggested by @rob-p may be useful to quickly create these lists. I also wondered if this kind of index could be used to further read mapping from resequencing data? E.g. if you want to compare variation from a new individual to the graph constructed from the reference individuals? Thanks for the support!