Closed rickbeeloo closed 5 months ago
Hello @rickbeeloo,
Glad that you found Cuttlefish useful! About the empty output issue, I see that you did not provide any reference files to the build
command; although, the reference file(s) are required by Cuttlefish. Theoretically, Cuttlefish compacts an edge-centric de Bruijn graph, and needs the (k + 1)-mers from the input to work with; and that is why the references are required. Cuttlefish accepts the input reference collection in three format (-r
, -l
, -d
). And since it is possible to input the collection mixing these parameters, we kept the parameters optional. Although, we should add an empty input collection checker so that the build process warns the user of the issue.
In brief, I see that you built the KMC database over a list paths.txt
of references. So, you should add -l paths.txt
to the parameters list in the cuttlefish build
command, and it should work. Please let us know if this fixes the problem, and if yes, then please close the issue.
On a different note, this issue emphasizes that we need to have some toy example of the full build pipeline to help users interpret the parameters correctly. Thanks for the issue. Regards!
Hi,
I have a related question: how to you see "colors" in the final gfa1/gfa2 output?
I ran cuttelfish with -l paths.txt
, successfully producing a large gfa file. How can we know that a given segment is specific or shared among references?
Thanks!
Hi!
I'm paraphrasing @rob-p on this:
We did struggle a bit regarding the best terminology here (for "colors"). Since cuttlefish isn’t an index, nothing is indexed per-se (specifically, we don't associate the color-information with each segment). When we build the compacted graph over multiple references, we consider it colored in the sense that the color information is encoded in the
P
(GFA1) orO
(GFA2) entries.
To answer your query on how to know if some segment with id x
is shared among multiple references or not — each reference has a corresponding path-tiling in the GFA format (with the tag P
or O
based on the version), and that specific segment is present in those references whose paths contain the id x
.
Regards!
Thanks for the question @EA2106-Universite-Francois-Rabelais!
Relatedly, it would be possible in a post-processing step to build an index that would let you query with a contig (segment) and return the reference list in which that segment occurred. Would this type of index be useful to have in your use case? If so what would an ideal interface look like for this functionality?
@jamshed-k @rob-p I would love to extend a little bit on this. It depends on what @EA2106-Universite-Francois-Rabelais means with "segment". If this is the S
tag in the GFA this would be relatively straightforward as @jamshed-k points out, you can use a simple set intersection algorithm to find shared and unique segments based on P/O
. However this gets more complicated when we are interested in finding large shared sequences (i.e. multiple segments between genomes), such as:
G1: S1, S2, S3, S4
G2: S0, S2, S3, S10
Then it would be cool to get a list saying S2, S3
found in G1 and G2 (with ideally the positions too). This can be defined as the continuous frequent itemset problem in pan genome graphs (see this paper more detail). To find such shared regions we have tools such as Sibelia, SibeliaZ and FindFRs. For our dataset, consisting of diverse bacterial genomes, we find large blocks using Sibelia but it's terribly slow, SibeliaZ is fast but produces much shorter blocks. FindFR on the other hand is super fast and allows us to tolerate insertions but unfortunately cannot be used with the output of recent tools (see my discussion here)
Thanks @rob-p , @rickbeeloo & @jamshed-k for the interesting discussion. I am new to the gfa file formats and looking for ways to extract relevant biological information. I agree with @rickbeeloo regarding the access to a list of shared segments. I suppose that an index as suggested by @rob-p may be useful to quickly create these lists. I also wondered if this kind of index could be used to further read mapping from resequencing data? E.g. if you want to compare variation from a new individual to the graph constructed from the reference individuals? Thanks for the support!
First of all, cool tool!
We're trying it to run for a large collection of genomes and first built the k-mer database using:
kmc -k31 -m450 -fm -t30 -ci0 @paths.txt test_graph .
And then ran cuttlefish:cuttlefish build -k 31 -s test_graph -o cuttle_graph -f 2 -t 30
This is the corresponding output:
The resulting GFA file only has the header (
H VN:Z:2.0
) any idea what's going wrong here? When using the default output (0
) there is no temporary file error however the output is also empty