cfe-lab / MiCall

Pipeline for processing FASTQ data from an Illumina MiSeq to genotype human RNA viruses like HIV and hepatitis C
https://cfe-lab.github.io/MiCall
GNU Affero General Public License v3.0
14 stars 9 forks source link

Test more SARS-CoV-2 samples #555

Open donkirkby opened 4 years ago

donkirkby commented 4 years ago

After finishing the SARS-CoV-2 support in #549, do more extensive testing with published sample data. List of samples to download and the toolkit to download with.

Find more samples from SRA by searching for "Severe acute respiratory syndrome-related coronavirus"[orgn:__txid694009]. You can filter by platform, and there are currently 466 Illumina records.

It can be tricky to find the published consensus sequences for a sample. I registered for GISAID and found Accession EPI_ISL_408670, but it took me a while to figure out that the descriptions in the SRA abstract for SRR11140746 (SARS-CoV-2/2019-nCoV/USA-WI-1/2020) loosely match the virus name in GISAID for EPI_ISL_408670 (hCoV-19/USA/WI1/2020).

Art's advice:

I queried the SRR number in the NCBI SRA database to get the sample description and then searched for a similar description in the GISAID annotations. Not perfect, I know.

donkirkby commented 4 years ago

@dmacmillan, the code to test is currently on the MultiuseDocker branch.

dmacmillan commented 4 years ago

I've run the following samples through MiCall via Docker on Windows 10 Home successfully!

Sample Time (m)
SRR11593354 192
donkirkby commented 4 years ago

That's great, @dmacmillan! Have you found a consensus sequence to compare it to?

cbrumme commented 4 years ago

The sample ID is "NRW-011" So try GISAID Accession# "EPI_ISL_414507"

dmacmillan commented 4 years ago

I am waiting on a confirmation email so that I can search via GISAID

dmacmillan commented 4 years ago

I found another sample/consensus sequence, I'll keep track of the one's that I have found in this comment:

Sample Consensus Time (m)
SRR11593354 EPI_ISL_414507 192
SRR11578347 EPI_ISL_427026 Not run
SRR11578346 EPI_ISL_426898 Not run
SRR10903401 EPI_ISL_414507 Not run
Pre-existing Table Run Compared to Differences
SRR11593354_1.fastq EPI_ISL_414507 0 mismatches, 0 missing, and 648 added out of 29225.
SRR11593355_1.fastq EPI_ISL_414574 0 mismatches, 0 missing, and 435 added out of 29438.
SRR11593356_1.fastq EPI_ISL_414509 1 mismatches, 0 missing, and 91 added out of 29782.
SRR11593357_1.fastq EPI_ISL_414508 0 mismatches, 0 missing, and 395 added out of 29490.
SRR11593358_1.fastq EPI_ISL_414506 0 mismatches, 0 missing, and 887 added out of 28933.
SRR11593359_1.fastq EPI_ISL_414505 0 mismatches, 0 missing, and 92 added out of 29782.
SRR11593360_1.fastq EPI_ISL_414504 0 mismatches, 0 missing, and 447 added out of 29426.
SRR11593361_1.fastq EPI_ISL_414499 2 mismatches, 0 missing, and 144 added out of 29782.
SRR11593362_1.fastq EPI_ISL_414498 0 mismatches, 0 missing, and 384 added out of 29490.
SRR11593364_1.fastq EPI_ISL_414497 0 mismatches, 0 missing, and 65 added out of 29779.
SRR11593365_1.fastq EPI_ISL_413488 10 mismatches, 0 missing, and 145 added out of 29746.
SRR11578341 EPI_ISL_426901 2 mismatches, 1 missing, and 617 added out of 29249.
SRR11578342 EPI_ISL_426900 1 mismatches, 0 missing, and 398 added out of 29286.
SRR11578343 EPI_ISL_426899 0 mismatches, 0 missing, and 429 added out of 29462.
SRR11578344 EPI_ISL_426899 15 mismatches, 2 missing, and 414 added out of 29462.
SRR11578345 EPI_ISL_426656 8 mismatches, 17 missing, and 398 added out of 29498.
SRR11578346 EPI_ISL_426898 0 mismatches, 0 missing, and 488 added out of 29315.
SRR11578347 EPI_ISL_427026 0 mismatches, 0 missing, and 148 added out of 29676.
SRR11578348 EPI_ISL_427025 1 mismatches, 1 missing, and 452 added out of 29411.
SRR11578349 EPI_ISL_427024 1 mismatches, 0 missing, and 564 added out of 29301.
SRR10903401-SARS_S1 MN988669.1 Very good: 12 mismatches in the first 24 bases under low coverage, and 21 extra A's at the end out of 29881.
SRR10903402-SARS_S2 MN988668.1 Almost perfect: 21 extra A's at the end out of 29881.
SRR11092056-SARS_S3 MN996530 Bad: 899 mismatches, 17761 missing, and 217 added out of 29854.
SRR11092057-SARS_S4 MN996528.1 Very good: 4 mismatches, 33 missing, and 12 added out of 29891. Missing 14 at the start, a gap of 15 with no coverage at 5397, plus 4 single gaps of no coverage within 20 bases. The mismatches are all in low coverage, 3 are mixtures when coverage is 2. 12 extra A's at the end..
SRR11092058-SARS_S5 MN996527.1 Bad: lots of sections with no coverage. 38 mismatches, 7606 missing, and 26 added out of 29825.
SRR11092064-SARS_S6 MN996531.1 Bad: lots of sections with no coverage. 24 mismatches, 4667 missing, and 33 added out of 29857.
SRR11140744-SARS_S7 EPI_ISL_408670 Almost perfect: 28 missing from the start, and poly-A tail replaced with ACAGATATATACGCC out of 29879.
SRR11140746-SARS_S8 EPI_ISL_408670 Almost perfect: poly-A tail replaced with AATAWMAACAAACAGAGCCTAAAAAGGACAAAA4 out of 29879.
SRR11140748-SARS_S9 EPI_ISL_408670 Almost perfect: 6 missing from poly-A tail out of 29879.
SRR11140750-SARS_S10 EPI_ISL_408670 Almost perfect: 9 missing from the start, and poly-A tail replaced with ACAATTGCAACAATC out of 29879.
SRR11177792-SARS_S11 MT072688 Almost perfect: 57 added out of 29811. A few added to start, most added at end: AGTGCTGAG + poly-A tail.
SRR11314339-SARS_S12 MT192765 Almost perfect: 38 added out of 29829. A few added to start, most added at end: CCATGTGATTTTAATAG + poly-A tail.
dmacmillan commented 4 years ago

@cbrumme @donkirkby I couldn't find a reference for sample SRR11578344, any ideas? If not I can find another.