Closed rgerhards closed 2 years ago
Yes, they're entirely different models- pangolearn in pangolin 3.0 is a decision tree model and pangolearn in pangolin 4.0 is a random forest model (greater memory requirements, but more robust), so differences are expected (see more info here: https://cov-lineages.org/resources/pangolin/pangolearn.html).
thx for the clarification. Was just surprised by the magnitude.
I see large differences in assignments when using pangoLEARN "version" 2022-03-22 under pangolin 3 vs 4. Is this expected behavior? A small sample with variant counts for the same FASTA set:
A google sheet with more select differences is available here: https://docs.google.com/spreadsheets/d/16vpQyPn5Hpbczboib_BjSlpiW74ee_urDVCFQuX49Qg/edit#gid=0
I compare two pangolin runs:
Both runs use pangoLEARN mode and (as far as I can tell from the RKI csv file) the same pangoLEARN model. As a purely wild guess, can the diffrent result be related to the new random forest model? And as such is to be expected?
Disclaimer: I am very far from being a bioinformatics pro (just some base courses), I am actually a close-to-system level software developer with expert field of computer systems logging and parallelization. I am investigating the evolution of CoV as a citicen "science" project. I know how to check the validity of my pipeline and reached out to others in order to ensure this issue seems to be valid. However. I may still be missing some obvious point.
If such differences between running the same pangoLEARN on v3 vs. v4 exists, it would possibly be a good idea to mention it in changelog or similar place.