COMBINE-lab / cuttlefish

Building the compacted de Bruijn graph efficiently from references or reads.
BSD 3-Clause "New" or "Revised" License
83 stars 9 forks source link

Cuttlefish 1 - Color by file instead of record #34

Closed aryakaul closed 1 year ago

aryakaul commented 1 year ago

Hello!

I'm interested in using cuttlefish 1.0 because of your approach to coloring that yields monochromatic unitigs. I was wondering if you had any suggestions for ways I could color by input files instead of by individual records in fasta files.

I had considered just merging my Fasta records into one record; however, then that would include artificial k-mers formed by appending contigs together.

I'm aware this is not the recommended usage of cuttlefish but for our research question this approach is necessary. Do you have any suggestions for ways to easily do this?

Thank you!

Best, Arya

jamshed commented 1 year ago

Hi Arya,

Thanks for using Cuttlefish! Just making sure how you're using it: currently we do not output color-information explicitly. Are you inferring the color-set of the maximal unitigs through an invertex-indexing like approach from the GFA-tilings?

Currently in the path-tiling output of the GFA / GFA-reduced format, each tiling has a corresponding header including the following information: Reference:x_Sequence:y where x is the sequence-ID of the reference in input (i.e. it is the x'th FASTA file), and y is the sequence-ID of the corresponding record in that file—as discussed here. So the "Reference:x"-information should provide you the corresponding color of the x'th file.

Let me know if I got your query right!

aryakaul commented 1 year ago

Thanks for the clarification Jamshed! I think I figured out the problem. I erroneously expected these two commands to be identical:

(cuttlefish) ➜ cuttlefish build -s ../data/split0_part1.fasta -s ../data/split0_part2.fasta -t 14 -k 31 -m 28 -f 1 -w /n/scratch3/users/a/ak586/tmp_cuttlefish -o ./test_1 
(cuttlefish) ➜ cat fof.txt
../data/split0_part1.txt
../data/split0_part2.txt
(cuttlefish) ➜ cuttlefish build -l ./fof.txt -t 14 -k 31 -m 28 -f 1 -w /n/scratch3/users/a/ak586/tmp_cuttlefish2 -o ./test_2 

But inspecting both of their outputs this is not the case! Only the last file ../data/split0_part2.fasta is read from the first command

(cuttlefish) ➜ grep '^P' ./test_1.gfa1 | cut -f2 | cut -f1 -d'_' | sort | uniq -c
  34839 Reference:1
(cuttlefish) ➜ grep '^P' ./test_2.gfa1 | cut -f2 | cut -f1 -d'_' | sort | uniq -c                               
  35252 Reference:1
  34839 Reference:2

I probably just misunderstood the documentation, but in case this is not intended wanted to bring it to your attention, thanks again!

jamshed commented 1 year ago

Hi Arya,

You're right—the parsing of the arguments does not seem to match what we have in the documentation here. Seems like a problem with interfacing with the cxxopt library—they do mention that multiple arguments can be passed as -s ... -s ..., here. Maybe we're missing arguments because of wrapping the vector with an optional, instead of directly using the vector as in their example.

Thanks for bringing this to our attention!