denglab / SeqSero2

SeqSero2
Other
33 stars 18 forks source link

Outcome of O2/O9 comparison can be random #46

Open VanOverbeeke opened 2 years ago

VanOverbeeke commented 2 years ago

In some of our WGS datasets, SeqSero2 (v1.1.1 using the default microassembly approach) returns O2 or O9 seemingly at random:

Sample Predicted antigenic profile Predicted serotype
replicate_a_1.trimmed.fastq.gz 2:l,z28:1,5 I 2:l,z28:1,5
replicate_b_1.trimmed.fastq.gz 9:l,z28:1,5 Javiana
replicate_c_1.trimmed.fastq.gz 2:l,z28:1,5 I 2:l,z28:1,5
replicate_d_1.trimmed.fastq.gz 9:l,z28:1,5 Javiana
replicate_e_1.trimmed.fastq.gz 9:l,z28:1,5 Javiana
replicate_f_1.trimmed.fastq.gz 2:l,z28:1,5 I 2:l,z28:1,5
replicate_g_1.trimmed.fastq.gz 2:l,z28:1,5 I 2:l,z28:1,5
replicate_h_1.trimmed.fastq.gz 9:l,z28:1,5 Javiana

We have traced the issue back to this decision block: https://github.com/denglab/SeqSero2/blob/master/bin/SeqSero2_package.py#L740.

The contigs stored in the special_genes dictionary are 3 relatively short contigs for O2, and 3 for O9. The contigs and the scores are identical in each run. However, when the scores for O2 and O9 hits are being compared on line 745 (see https://github.com/denglab/SeqSero2/blob/master/bin/SeqSero2_package.py#L745), only the last value of O2 is compared to the last value of O9. However, due to the random nature of iterating over dictionaries in Python3, the 'last' value in the special_genes dictionary is not always the highest. special_genes dictionary is random (as is known for some/most versions of Python3).

In the case of multiple short contigs for the O gene, that match both O2 and O9, this means that a higher scoring contig can be passed first in the iteration, followed by a lower scoring contig. The second score then replaces the first, even though the first score was higher. In a different SeqSero2 run, the lower score might be passed first in the iteration, followed by the higher score, leading to a different pair of scores to be compared on line 745 in the script.

I propose the following solution:

LSTUGA commented 2 years ago

Sorry for the late response. Could you please share the accession numbers of your questionable genomes. We will look into this issue.

VanOverbeeke commented 2 years ago

Hi, no problem. The issue occurs with FASTQ input of two of our own samples, in a custom Docker environment. I can share the anonymized samples and the Docker image over a filesharing platform. Do you have a preference for which platform?

LSTUGA commented 2 years ago

Hi, a Docker image would be fine. Just let me know how to access your samples. Thanks!

VanOverbeeke commented 2 years ago

Hi,

The image is here: https://git.wur.nl/overb015/public/container_registry/872 The data is here: https://filesender.surf.nl/?s=download&token=01f77bbd-497d-451a-84a2-227b76cc1d37 The command we use is: """bash

    PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/SPAdes-3.9.0-Linux/bin:/SalmID-0.11:/sratoolkit.2.8.0-ubuntu64/bin

    SeqSero2_package.py \            -p 1 \            -t 2 \
  -d "${samplename}" \            -i "${forward}" "${reverse}"

"""

Please let me know if you need any help.

Kind regards, Lennert

On Fri, Jul 22, 2022 at 3:17 PM LSTUGA @.***> wrote:

Hi, a Docker image would be fine. Just let me know how to access your samples. Thanks!

— Reply to this email directly, view it on GitHub https://github.com/denglab/SeqSero2/issues/46#issuecomment-1192564378, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHUA2TPFCKO6AEDKEIRV2ILVVKNOTANCNFSM5SSPZ6NA . You are receiving this because you authored the thread.Message ID: @.***>

VanOverbeeke commented 2 years ago

One more question, does SeqSero2 really take absolutely raw and completely untrimmed reads as input? We have seen some improvement using completely raw input files.