Validate MiCall on SARS-COV-2 Illumina data

ArtPoon commented 4 years ago

There are presently 34 SARS-COV-2 samples on NCBI generated on an Illumina platform: see https://trace.ncbi.nlm.nih.gov/
There are a variety of approaches being taken to assemble or map these reads to generate a sample consensus sequence
Does MiCall generate the same consensus sequence from the FASTQ data as other groups? (concordance)
Are there links between the FASTQ data on NCBI SRA and consensus sequences in NCBI Genbank or GISAID?

ArtPoon commented 4 years ago

Processed SRR11140746 from NCBI SRA. This is annotated as SARS-CoV-2/2019-nCoV/USA-WI-1/2020, passaged once on Vero E6 cells. I compared the MiCall consensus sequence to hCoV-19/USA/WI1/2020|EPI_ISL_408670 from GISAID. There are a total of 8 differences:

position	MiCall	GISAID
20302	A	-
20303	A	-
20304	G	-
29872	t	A
29878	c	A
29880	-	A
29881	-	A

Here are screen caps of the two regions: problem1

problem2

I am not too concerned about the last four because they appear in a low-coverage region (By default, MiCall-Lite reports bases in lower case when coverage is below 100 reads.)

However, the insertion of AAG is worrisome. Note that it is in a tandem repeat. Is this in the raw data?

ArtPoon commented 4 years ago

It is not in the raw data:

Elzar:data artpoon$ gunzip -c SRR11140746_1.fastq.gz | grep AAGAAGAAGGCTA
Elzar:data artpoon$ gunzip -c SRR11140746_1.fastq.gz | grep AAGAAGGCTA
GGCGGCTATGCTCTCCTCTACAGTGTAACCATTTAAACCCTGACCCGGGTAAGTGGTTATATAATTGTCTGTTGGCACTTTTCTCAAAGCTTTCGCTAGCATTTCAGTAGTGCCACCAGCCTTTTTAGTAGGTATAACCACAGCAGTTAAAACCCAGGAGTCAAATGGAAATTGATTTCTTAGAATTAGCTATGGATGAATTCATTGAACGGTATAAAGAAGGCTATGCCTTCGAACATATCGT
CATTAATTTGCGTGTTTCTTCTGCATGTGCAAGCATTTCTCGCAAATTCCAAGAAACAGTTCCAAGAATTTCTTGCTTCTCATTAGAGATAATAGATGGTAGAATGTAAAAGGCACTTTTACACTTTTTAAGCACTGTCTTTGCCCCCTCTACAGTGTAACCAAATTTAAACCCAGGAGTCAAATGGAAATTGATTTCTTAGAATTAGCTATGGATGAATTCATTGAACGGTATAAAGAAGGCTATGCCTT

Meaning that MiCall is making it up. I'm going to have to trace back through the intermediate outputs to figure out where this is coming from.

ArtPoon commented 4 years ago

It does appear once in the second read file:

Elzar:data artpoon$ gunzip -c SRR11140746_2.fastq.gz | grep AAGAAGAAGGCTA
TTACAAACATTGGCCGCAAATTGCACAATTTGCCCCCAGCGCTTCAGCGTTCTTCGGAATGTCGCGCATTGGCATGGAAGTCACACCTTCGGGAACGTGGTTGACCTACACAGGTGCCATCAAATTGGATGACAAAGATCCAAATTTCAAAGATCAAGTCATTTTGCTGAATAAGCATATTGACGCATACAAAACATTCCCACCAACAGAGCCTAAAAAGGACAAAAAGAAGAAGGCTAATGAAACTCAA

Is MiCall stitching this one insertion into the data?

ArtPoon commented 4 years ago

The plot thickens - I ran the FASTQ data through bowtie2 once and then generated a consensus sequence with samtools, and it also introduced the extra AAG insertion as well as some weird substitutions upstream: problem3

Also samtools doesn't botch the 3' end: problem4

We are not ready for prime time. 😞

ArtPoon commented 4 years ago

Options:

bring MiCall-Lite fork up to date with current MiCall - is there a problem that I introduced or something that got fixed by the CfE dev team?
start debugging MiCall-Lite based on this first test case
write a bespoke pipeline for SARS-COV-2

ArtPoon commented 4 years ago

I got 100% map rate with a single run of bowtie2, local alignment, so I don't think MiCall's iterative approach to mapping is necessary. Will adapt functions from sam2aln.py and aln2counts from cfe-lab/MiCall to write a more efficient consensus sequence-generating script.

ArtPoon commented 4 years ago

Looked at the SAM file with Tablet, where we can see the deletion relative to the NC_045512 reference genome: problem5

Here's a look at the end of the alignment: problem6

ArtPoon commented 4 years ago

@donkirkby can you please help me reproduce this issue with cfe-lab/MiCall? The NCBI SRA accession number is in this issue.

ArtPoon commented 4 years ago

New pipeline is picking the deletion up:

20295,0,0,0,1247,156,1,{}
20296,1244,1,0,0,155,1,{}
20297,1242,0,0,0,155,4,{}
20298,263,0,0,0,155,981,{}
20299,0,0,0,2,156,1018,{}
20300,0,0,1,0,156,1019,{'G': 1}
20301,1183,0,1,0,157,1,{}
20302,0,0,1186,0,156,1,{}
20303,1185,0,0,0,156,2,{}

Closing issue and tracking any further issues in new repo.

donkirkby commented 4 years ago

I'm getting our references set up today, @ArtPoon, and I'll let you know what I see when I run this sample.

ArtPoon commented 4 years ago

Thanks @donkirkby !

donkirkby commented 4 years ago

OK, I got some results out of MiCall, and I'm starting to look through them. First item: your mystery insertion. I see it, but as a minority variant. It's also in a slightly different position from yours. The MAX consensus shows a deletion at that position. Here's a section of the consensus sequences at different cutoffs from 20290 to 20310:

        v20290...   ...20310v
MAX     CGGTAT---AAAGAAGGCTAT
0.01    CGGTATaaarAAGAAGGCTAT
0.02    CGGTATaaaRAAGAAGGCTAT
0.05    CGGTATaaaRAAGAAGGCTAT
0.10    CGGTATaaaRAAGAAGGCTAT
0.20    CGGTATaaaRAAGAAGGCTAT
0.25    CGGTAT---AAAGAAGGCTAT

The lower case codes represent a mixture of nucleotides and deletions. My guess is that MiCall-lite still reports any nucleotide reads over deletions, even if the deletions are over 75% of the reads as in this case. What's probably happening is that this sample has a deletion relative to the reference sequence. Then the reads with ends near the deletion get dragged across the gap and fill it in with bad matches. However, I think you said that MiCall was actually inserting that section during the remapping, so I'll check more closely tomorrow. Hope this is a useful start. Where's the other repo?

ArtPoon commented 4 years ago

Thanks for checking this out @donkirkby - the new repo is set private for now. I'll publish it soon, I just need time to fix the merge_pairs function that I migrated from sam2aln. I know that my version is a lot slower and the refactored version in cfe MiCall is probably a lot faster, but I also really don't like that code. (It introduces NumPy as another dependency, and the author didn't bother to document any of their code. barf!)

ArtPoon commented 4 years ago

I apologize @donkirkby, I see that we're the only contributors to that code - but why no inline comments? Throw me a bone! :smile:

donkirkby commented 4 years ago

The programmer that has annoyed me the most throughout my career has always been myself from six months earlier. Would you like to go through the code together tomorrow morning over Google hangouts? I'm free at 10am PDT.

ArtPoon commented 4 years ago

Fair enough - I've written some real pigs as a perennial amateur. I'm teaching a bioinformatics lab (ironically) tomorrow so I can't meet, unfortunately. I have some basic ideas of how to make my original code a lot faster though. May pester you for some sage wisdom.

donkirkby commented 4 years ago

You might not need such drastic changes, @ArtPoon. I think the reason you're not getting a clean deletion is that you're not counting deletions when building the consensus:

    intermed = self.counts.most_common()

    # Remove gaps and low quality reads if there is anything else.
    for i in reversed(range(len(intermed))):
        nuc, _count = intermed[i]
        if nuc in ('N', '-') and len(intermed) > 1:
            intermed.pop(i)

    total_count = sum(self.counts.values())
    mixture = []
    min_count = (intermed[0][1]
                 if mixture_cutoff == MAX_CUTOFF
                 else total_count * mixture_cutoff)
    # filter for nucleotides that pass frequency cutoff
    for nuc, count in intermed:
        if count >= min_count:
            mixture.append(nuc)

Here's the equivalent code in our version:

    mixture = []
    for nuc, count in self.counts.most_common():
        if count < min_count:
            break
        if nuc in self.COUNTED_NUCS:
            mixture.append(nuc)
            if mixture_cutoff == MAX_CUTOFF:
                # Catch any ties before breaking out.
                min_count = count

donkirkby commented 4 years ago

Where did you get the GISAID consensus sequence? I'd like to compare my results.

ArtPoon commented 4 years ago

Thanks @donkirkby - I figured I had some legacy code in my branch that had been changed since the fork, but I simply didn't have the bandwidth to search for it this week. I'm pretty much finished the other pipeline anyhow, and I really don't think iterative remapping is necessary, so I'm probably going to ahead with validating my pared down version of MiCall.

I registered for GISAID and then searched for a sequence with a matching descriptor.

donkirkby commented 4 years ago

I registered for GISAID and found Accession EPI_ISL_408670, but it took me a while to figure out that the descriptions in the SRA abstract for SRR11140746 (SARS-CoV-2/2019-nCoV/USA-WI-1/2020) loosely match the virus name in GISAID for EPI_ISL_408670 (hCoV-19/USA/WI1/2020). Thanks for the clues!

After all that, I can say that MiCall generates a matching consensus to EPI_ISL_408670, for positions from 114 to 29835. The 113 bases before that and the 44 bases after had fewer than 100 reads, and weren't reported.

ArtPoon commented 4 years ago

Sorry @donkirkby - I didn't mean to obscure the source. I should have provided the EPI identifier.

ArtPoon commented 4 years ago

Since the primary objective is to obtain a consensus sequence rather than to sample at depth (i.e., to detect minor variants), shouldn't MiCall use a lower reporting threshold?

donkirkby commented 4 years ago

You did provide the EPI identifier, @ArtPoon. That's the clue I was thanking you for! I was just struggling to figure out how you knew they were the same sample.

I'll discuss the reporting threshold with the team. Thanks for the suggestion!

ArtPoon commented 4 years ago

Published new repo at http://github.com/PoonLab/sam2conseq

PoonLab / MiCall-Lite

Validate MiCall on SARS-COV-2 Illumina data #42