COMBINE-lab / cuttlefish

Building the compacted de Bruijn graph efficiently from references or reads.
BSD 3-Clause "New" or "Revised" License
83 stars 9 forks source link

How to output maximal unitigs? #42

Closed dwpeng closed 3 months ago

dwpeng commented 3 months ago

Hi, Jamshed @jamshed

Thanks for cuttlefish giving me a new way to build cDBG in a very fast speed. I have one question after reading your paper. I noticed that you find the maximal unitigs by traversaling vertex mentioned in paper (called "two walks"). Every walk would stop until finding a "Fuzzy-Side" and concat two path if possible. So I check the code in github repo and try to find how to implement the corresponding logic. But there maybe some different with paper detailed in the repo.

I read output_maximal_unitigs_plain in src/CdBG_Plain_Writer.cpp:, I find there is not vertex traversaling, but visited every kmer in seq[left_end:right_end]. Firstly, find start_kmer and then try to find end_kmer. If paired end kmers were found, the unitig will be ouputed if not ouputed.

This is my understanding of how to output unitg. Please point it out for me if I'm wrong.


template <uint16_t k>
size_t CdBG<k>::output_maximal_unitigs_plain(const uint16_t thread_id, const char* const seq, const size_t seq_len, const size_t right_end, const size_t start_idx)
{
    size_t kmer_idx = start_idx;

    // assert(kmer_idx <= seq_len - k);

    Annotated_Kmer<k> curr_kmer(Kmer<k>(seq, kmer_idx), kmer_idx, *hash_table);

    // The subsequence contains only an isolated k-mer, i.e. there's no valid left or right
    // neighboring k-mer to this k-mer. So it's a maximal unitig by itself.
    if((kmer_idx == 0 || Kmer<k>::is_placeholder(seq[kmer_idx - 1])) &&
        (kmer_idx + k == seq_len || Kmer<k>::is_placeholder(seq[kmer_idx + k])))
        output_plain_unitig(thread_id, seq, curr_kmer, curr_kmer);
    else    // At least one valid neighbor exists, either to the left or to the right, or on both sides.
    {
        // No valid right neighbor exists for the k-mer.
        if(kmer_idx + k == seq_len || Kmer<k>::is_placeholder(seq[kmer_idx + k]))
        {
            // A valid left neighbor exists as it's not an isolated k-mer.
            Annotated_Kmer<k> prev_kmer(Kmer<k>(seq, kmer_idx - 1), kmer_idx, *hash_table);

            if(is_unipath_start(curr_kmer.state_class(), curr_kmer.dir(), prev_kmer.state_class(), prev_kmer.dir()))
                // A maximal unitig ends at the ending of a maximal valid subsequence.
                output_plain_unitig(thread_id, seq, curr_kmer, curr_kmer);

            // The contiguous sequence ends at this k-mer.
            return kmer_idx + k;
        }

        // A valid right neighbor exists for the k-mer.
        Annotated_Kmer<k> next_kmer = curr_kmer;
        next_kmer.roll_to_next_kmer(seq[kmer_idx + k], *hash_table);

        bool on_unipath = false;
        Annotated_Kmer<k> unipath_start_kmer;
        Annotated_Kmer<k> prev_kmer;

        // No valid left neighbor exists for the k-mer.
        if(kmer_idx == 0 || Kmer<k>::is_placeholder(seq[kmer_idx - 1]))
        {
            // A maximal unitig starts at the beginning of a maximal valid subsequence.
            on_unipath = true;
            unipath_start_kmer = curr_kmer;
        }
        // Both left and right valid neighbors exist for this k-mer.
        else
        {
            prev_kmer = Annotated_Kmer<k>(Kmer<k>(seq, kmer_idx - 1), kmer_idx, *hash_table);
            if(is_unipath_start(curr_kmer.state_class(), curr_kmer.dir(), prev_kmer.state_class(), prev_kmer.dir()))
            {
                on_unipath = true;
                unipath_start_kmer = curr_kmer;
            }
        }

        if(on_unipath && is_unipath_end(curr_kmer.state_class(), curr_kmer.dir(), next_kmer.state_class(), next_kmer.dir()))
        {
            output_plain_unitig(thread_id, seq, unipath_start_kmer, curr_kmer);
            on_unipath = false;
        }

        // Process the rest of the k-mers of this contiguous subsequence.
        for(kmer_idx++; on_unipath || kmer_idx <= right_end; ++kmer_idx)
        {
            prev_kmer = curr_kmer;
            curr_kmer = next_kmer;

            if(is_unipath_start(curr_kmer.state_class(), curr_kmer.dir(), prev_kmer.state_class(), prev_kmer.dir()))
            {
                on_unipath = true;
                unipath_start_kmer = curr_kmer;
            }

            // No valid right neighbor exists for the k-mer.
            if(kmer_idx + k == seq_len || Kmer<k>::is_placeholder(seq[kmer_idx + k]))
            {
                // A maximal unitig ends at the ending of a maximal valid subsequence.
                if(on_unipath)
                {
                    output_plain_unitig(thread_id, seq, unipath_start_kmer, curr_kmer);
                    on_unipath = false;
                }

                // The contiguous sequence ends at this k-mer.
                return kmer_idx + k;
            }
            else    // A valid right neighbor exists.
            {
                next_kmer.roll_to_next_kmer(seq[kmer_idx + k], *hash_table);

                if(on_unipath && is_unipath_end(curr_kmer.state_class(), curr_kmer.dir(), next_kmer.state_class(), next_kmer.dir()))
                {
                    output_plain_unitig(thread_id, seq, unipath_start_kmer, curr_kmer);
                    on_unipath = false;
                }
            }
        }
    }

    // Return the non-inclusive ending index of the processed contiguous subsequence.
    return kmer_idx + k;
}

dwpeng

jamshed commented 3 months ago

Hi @dwpeng: good to know your interest in the algorithm!

Based on your description, you read the Cuttlefish 2 paper; but the implementation you're looking at is for the original Cuttlefish paper. You'll find the relevant implementation of the specific methods you're looking for at these files (and their .cpp counterparts): Read_CdBG.hpp, Read_CdBG_Constructor.hpp, and Read_CdBG_Extractor.hpp.

Regards.

dwpeng commented 3 months ago

I am so happy to receive your reply. Thank you. I will reread your code.

Regards.