cov-lineages / pangolin

Software package for assigning SARS-CoV-2 genome sequences to global lineages.
GNU General Public License v3.0
427 stars 107 forks source link

classification differences pangolin 3/4 w/ same pangoLEARN expected? #416

Closed rgerhards closed 2 years ago

rgerhards commented 2 years ago

I see large differences in assignments when using pangoLEARN "version" 2022-03-22 under pangolin 3 vs 4. Is this expected behavior? A small sample with variant counts for the same FASTA set:

tmp

A google sheet with more select differences is available here: https://docs.google.com/spreadsheets/d/16vpQyPn5Hpbczboib_BjSlpiW74ee_urDVCFQuX49Qg/edit#gid=0

I compare two pangolin runs:

  1. the official run done at Germany's Robert Koch Institute (RKI) and published at https://github.com/robert-koch-institut/SARS-CoV-2-Sequenzdaten_aus_Deutschland - they use pangolin 3.12.0
  2. an run done by myself using pangolin 4.0.2, results published at https://github.com/rgerhards/DESH-recomputed-RG

Both runs use pangoLEARN mode and (as far as I can tell from the RKI csv file) the same pangoLEARN model. As a purely wild guess, can the diffrent result be related to the new random forest model? And as such is to be expected?

Disclaimer: I am very far from being a bioinformatics pro (just some base courses), I am actually a close-to-system level software developer with expert field of computer systems logging and parallelization. I am investigating the evolution of CoV as a citicen "science" project. I know how to check the validity of my pipeline and reached out to others in order to ensure this issue seems to be valid. However. I may still be missing some obvious point.

If such differences between running the same pangoLEARN on v3 vs. v4 exists, it would possibly be a good idea to mention it in changelog or similar place.

aineniamh commented 2 years ago

Yes, they're entirely different models- pangolearn in pangolin 3.0 is a decision tree model and pangolearn in pangolin 4.0 is a random forest model (greater memory requirements, but more robust), so differences are expected (see more info here: https://cov-lineages.org/resources/pangolin/pangolearn.html).

rgerhards commented 2 years ago

thx for the clarification. Was just surprised by the magnitude.