denglab / SeqSero2

SeqSero2
Other
33 stars 18 forks source link

Subspecies determination #37

Open VT-20 opened 3 years ago

VT-20 commented 3 years ago

I am looking for – 1) how subspecies are determined in the judge_subspecies functions (both allele and k-mer mode)? Looks like only forward read of the sample is used to predict subsp. and matched with SalmID database in “a” mode and Special_dict is used in k-mer mode?

2)What does special_dict (in k-mer workflow) contain? the target mer sequences matched with invA gene sequences of the database? or else?

3) It seems that SalmID has 2 genes (invA and rpoB) of salmonella species/subsp., but k-mer database only contains invA gene dictionary along with H, and O, and specific genes. Why rpoB dictionary is not included in the k-mer database?

4) I observed misprediction of SRR3418287 as II 18:z4,z23:-, instead of IIIa 18:z4,z23:-. Is there any specific reason?

Thanks for your time.

hcdenbakker commented 3 years ago

Hi VT-20, I am the developer of the original SalmID so I can answer some of your questions, but not necessarily those pertaining as how SalmID is integrated into SeqSero2.

1) SalmId does not use forward and reverse reads, because the information contained in the forward reads is usually enough to accurately predict the subspecies 2 & 3) The original SalmID uses target kmer sequence matching both for invA and rpoB. rpoB is not variable enough within S. enterica to be included in the subspecies identification, I included it to serve as a marker for preliminary species ID. 4) I analyzed SRR3418287 with SalmID and it looks like this accession contains very low coverage subsp. II (1x) and high coverage (40x) subsp. IIIa. Interestingly a higher percentage of subsp II specific invA kmers are matched (100%) and a high but lower percentage (87%) of subsp IIIa kmers. I assume SeqSero2 does not use coverage to flag possible contamination like this.

Hope this does answers some of your questions.

LSTUGA commented 3 years ago

@hcdenbakker Many thanks for answering the questions! @VT-20 SeqSero2 does not flag contamination like this. When more than one subsp. markers are detected, SeqSero2 only reports the one with higher coverage, which is subsp. II in your case.