Suffix of consensus read

CSB5 / INC-Seq

INC-Seq: Accurate single molecule reads using nanopore sequencing

Other

13 stars 4 forks source link

Suffix of consensus read #6

Closed lch14forever closed 5 years ago

lch14forever commented 7 years ago

Reported by @Szymonome,

Found that BLASTN switch generates some information at the end of the read name e.g. Basecall_2D_7/0_730 while POA does not. Why is it like that? Moreover, what exactly does the number "7" stands for? Initially I thought it may be number or molecules used for sequence correction however, fasta file from GRAPHMAP switch have reads with numbers 2, 3, 4 or 5 that shouldn't happen as minimum amount should be 6, f I'm right.

lch14forever commented 7 years ago

Hi,

The number after 2D is the index of the template in the original read. For PBDAGCON, it needs the reads aligned to a template to call consensus, so we choose one segment in the tandem repeats as the template.

The template is chosen according to the following criteria:

BLASTN: choose the template with highest number of alignments (break ties with average %identity)
GraphMap: choose the template with highest average %identity (since the default behavior of graphamp is to try to get full alignment, usually we found almost all reads could be aligned to any chosen template)
- POA: the software does MSA itself, so no template choosing step in this software

Szymonome commented 7 years ago

Hi,

Thank you for your quick response. In this case how can I find quantity of segments used for each corrected read? pipeline.log file does not contain this information while logs/myeasylog.log file has got description of single read only.

lch14forever commented 7 years ago

Unfortunately, that info was not kept. I believe you can modify the following block in buildConsensus.py (from line 265) to print that information:

# blast alignment
for subRead in subReads:
    q_handle = open(tmpQ, 'w')
    SeqIO.write(subRead, q_handle, "fasta")
    q_handle.close()
    stdout = blastn(tmpQ, tmpRef, None, blastOutFMT, seg_num, seg_cov, True)
    num = stdout.count('\n') - 1
    errors = get_errors(stdout)
    if num > alignments["num"]:
        # prefer more alignments
        alignments["num"] = num
        alignments["error"] = errors
        alignments["alignments"] = stdout
    elif num == alignments["num"]:
        # for the same number of alignments, prefer the one with lower error rates
        if errors < alignments["errors"]:
                alignments["num"] = num
                alignments["error"] = errors
                alignments["alignments"] = stdout

lch14forever commented 7 years ago

Sorry, I think I understood your question wrongly. If you are looking for the number of segments used to construct consensus, you have to look for lines in the messages like follows:

Consensus called 2d46a7a6-0bb5-4994-b1bb-bc8494695cbb_Basecall_2D_2d Number of segments 7

and you can grep for the pattern 'Consensus called' with some commands as below:

paste <(grep 'Consensus called' YourLogFile) <(grep '^>' YourConsensus.fa )