fc_consensus.py parameter: max_n_read

zyndagj commented 8 years ago

Hello,

I'm trying to choose an ideal max_n_read to make sure repetitive regions are not over-corrected, and I was looking for a little clarification on the use of this parameter in the falcon consensus step. Does the consensus happen on a read-by-read basis, or in windows along a read? If it does take place read-by-read, does this number scale for longer reads? I would expect the number of aligning reads to be dependent on the primary (reference) read length.

Thanks, Greg Zynda

pb-jchin commented 8 years ago

Hi, @zyndagj, the logic there is to pick longest overlapped reads for correction.

zyndagj commented 8 years ago

Does this mean that if I set --max_n_read to 30, only the 30 longest reads are corrected, or have I gone off course?

pb-jchin commented 8 years ago

30 longest overlapped reads will be used. see https://github.com/PacificBiosciences/FALCON/blob/master/src/py/mains/consensus.py line 146-148 https://github.com/PacificBiosciences/DALIGNER/blob/master/LA4Falcon.c line 767-781

zyndagj commented 8 years ago

Jason,

Thanks for the links. I'm still a little fuzzy on what is happening, so let me walk through what I figured out from the sources you mentioned. In the LA4Falcon code, it looks like for every primary read greater than -H, the longest 400 (MAX_HIT_COUNT) alignments are passed to fc_consensus.py. The function script_run_consensus in bash.py doesn't use the -n flag, so MAX_HIT_COUNT doesn't change.

The function get_seq_data in consensus.py then reads in each primary (A-read) and the 400 longest matches associated with it. For each primary read, the alignment sequences are sorted again and the longest max_n_read alignments are kept for consensus.

Going back to the original example of --max_n_read 30, my current understanding is that for every primary (A-read) sequence, the 30 longest alignments will be used for consensus. It's good to know you are taking the longest alignments, but the number of matches is dependent on the primary read length. Have you encountered sparse overlap coverage from small reads in the past? Some of the data I have been working with has many shorter < 500bp reads, so I have been using -l400 for daligner and -x400 for DBsplit. Let me know if you have any recommendations.

Thanks, Greg

bredeson commented 8 years ago

Hey Jason,

What's the rationale behind using a max number of reads over a max average coverage of reads?

Jessen

pb-jchin commented 8 years ago

@bredeson, ideally, maybe finding a set of reads that evenly cover the seeds is good. In reality, that computation maybe tricky to get it right. Usually, there is no problem of unique regions. The problem is that when the seed has a high copy of repeats, it will recruit too many hit. The max number of reads is designed to cap the computation usage.

bredeson commented 8 years ago

Hey Jason,

Thank you for your response. By "max average coverage" I was thinking of something as simple to compute as follows (in consensus.py):

Replace max_n_read with max_cov line 121 (and throughout) and the condition at line 146 with:

if len(seqs) >= min_cov_aln:
    seqs = seqs[:1] + sorted(seqs[1:], key=lambda x: -len(x))
    max_n_read = 0
    sum_bases = 0
    seed_len = float(len(seqs[0]))
    for seq in seqs[1:]:
        sum_bases += len(seq)
        max_n_read += 1
        if sum_bases/seed_len > max_cov:
            break
    yield (seqs[:max_n_read], seed_id, config)

This would assume that hit reads are randomly distributed along the seed read and that the longest reads are not biased toward erroneously mapped reads/repeats, as the current implementation with max_n_read does. This would also have the advantage of scaling uniformly with longer reads.

Have you tried something like this while developing Falcon?

-Jessen

pb-jchin commented 8 years ago

Hi, @bredeson, I agree that using max_cov is in general better than max_n_read. Can you submit a PR?

PacificBiosciences / FALCON

fc_consensus.py parameter: max_n_read #284