Closed zyndagj closed 8 years ago
Hi, @zyndagj, the logic there is to pick longest overlapped reads for correction.
Does this mean that if I set --max_n_read
to 30, only the 30 longest reads are corrected, or have I gone off course?
30 longest overlapped reads will be used.
see https://github.com/PacificBiosciences/FALCON/blob/master/src/py/mains/consensus.py
line 146-148
https://github.com/PacificBiosciences/DALIGNER/blob/master/LA4Falcon.c
line 767-781
Jason,
Thanks for the links. I'm still a little fuzzy on what is happening, so let me walk through what I figured out from the sources you mentioned. In the LA4Falcon code, it looks like for every primary read greater than -H, the longest 400 (MAX_HIT_COUNT
) alignments are passed to fc_consensus.py. The function script_run_consensus
in bash.py doesn't use the -n
flag, so MAX_HIT_COUNT
doesn't change.
The function get_seq_data
in consensus.py then reads in each primary (A-read) and the 400 longest matches associated with it. For each primary read, the alignment sequences are sorted again and the longest max_n_read
alignments are kept for consensus.
Going back to the original example of --max_n_read 30
, my current understanding is that for every primary (A-read) sequence, the 30 longest alignments will be used for consensus. It's good to know you are taking the longest alignments, but the number of matches is dependent on the primary read length. Have you encountered sparse overlap coverage from small reads in the past? Some of the data I have been working with has many shorter < 500bp reads, so I have been using -l400
for daligner and -x400
for DBsplit. Let me know if you have any recommendations.
Thanks, Greg
Hey Jason,
What's the rationale behind using a max number of reads over a max average coverage of reads?
Jessen
@bredeson, ideally, maybe finding a set of reads that evenly cover the seeds is good. In reality, that computation maybe tricky to get it right. Usually, there is no problem of unique regions. The problem is that when the seed has a high copy of repeats, it will recruit too many hit. The max number of reads is designed to cap the computation usage.
Hey Jason,
Thank you for your response. By "max average coverage" I was thinking of something as simple to compute as follows (in consensus.py
):
Replace max_n_read
with max_cov
line 121 (and throughout) and the condition at line 146 with:
if len(seqs) >= min_cov_aln:
seqs = seqs[:1] + sorted(seqs[1:], key=lambda x: -len(x))
max_n_read = 0
sum_bases = 0
seed_len = float(len(seqs[0]))
for seq in seqs[1:]:
sum_bases += len(seq)
max_n_read += 1
if sum_bases/seed_len > max_cov:
break
yield (seqs[:max_n_read], seed_id, config)
This would assume that hit reads are randomly distributed along the seed read and that the longest reads are not biased toward erroneously mapped reads/repeats, as the current implementation with max_n_read
does. This would also have the advantage of scaling uniformly with longer reads.
Have you tried something like this while developing Falcon?
-Jessen
Hi, @bredeson, I agree that using max_cov
is in general better than max_n_read
. Can you submit a PR?
Hello,
I'm trying to choose an ideal
max_n_read
to make sure repetitive regions are not over-corrected, and I was looking for a little clarification on the use of this parameter in the falcon consensus step. Does the consensus happen on a read-by-read basis, or in windows along a read? If it does take place read-by-read, does this number scale for longer reads? I would expect the number of aligning reads to be dependent on the primary (reference) read length.Thanks, Greg Zynda