cov-lineages / pangoLEARN

Store of the trained model for pangolin to access.
GNU General Public License v3.0
55 stars 13 forks source link

decision_tree_rules.txt rules seem wrong #13

Open danielmoney opened 3 years ago

danielmoney commented 3 years ago

I've been investigating why pangolin was calling some B.1.617 sequences incorrectly as other lineages and as part of this have been looking at the decision_tree_rules.txt file. I ended up comparing the actual decision tree (loaded using joblib) with what was in decision_tree_rules.txt and there seems to be a mismatch.

I'm pretty certain that there is an off-by-one error somewhere. decision_tree_rules has the following as the first few decisions for the first B.1.617.2 rule: B.1.617.2 24505=='G',18423=='-',26800!='C',28882!='A',16888=='-',25504!='C',24075!='A' where as I believe it should be: B.1.617.2 24505=='T',18423=='A',26800!='G',28882!='C',16888=='A',25504!='G',24075!='C'

Given that the bases are ordered -,A,C,G,T in the one hot encoding it seems that every base should be one "higher". Presumably this also means the position is wrong when the actual test is on - as this would, I believe, show in the decision tree as a T at the previous position. It does appear that this affects all the rules not just that B.1.617.2 rule.

Given this is an off by one error I'm wondering if this is related to "lineage" being the first entry in the header array?

I realise that pangolin doesn't actually use decision_tree_rules.txt so this doesn't affect the calls made by pangolin but it will affect anyone trying to work out why pangolin made the call it did.

(As an aside the incorrect calling was due to the presence of an unknown base at a key location which pangolin assigned to be the reference base and which in turn led the assigner into the wrong part of the decision tree).

jaquol commented 3 years ago

We observed the same while trying to determine which mutations define that lineage.

Regardless of the issue, is decision_tree_rules.txt the best place to systematically determine which mutations define each lineage?

danielmoney commented 3 years ago

My view on this is that no, it's not (but note I'm not a member of the pangolin team) for two reasons: a) I believe you're likely to miss multiple mutations that occur on the same branch as, at least in theory, the decision tree would only have to use one mutation to make it's decision. (Although also see b!) b) There's some odd artifacts in the tree which I suspect come about from the replacement of unknown bases with the reference base. What this means is that if, for enough of the samples that is used to train the tree, one of the defining mutations is missing and is replaced with the reference bases then the decision tree could have rules where one of the defining mutations for that lineage is not present (and it calls the lineage based on one of the other defining mutations). I've definitely seen an example where it looks like that is what is happening but I haven't gone and seen what is going on with the learning dataset.

aineniamh commented 3 years ago

I would definitely not use the decision tree for finding which mutations define each lineage, the decision tree is fragile and as @danielmoney said includes imputed data. Its purpose is for pangolin to use for assignments- outbreak.info is a really great resource that lists the defining snps for a given lineage so would recommend that!

jaquol commented 3 years ago

Thank you both @danielmoney @aineniamh for your answers!

niemasd commented 3 years ago

I ran into the same thing when attempting to manually assign a lineage based on the decision tree for a sample that Pangolin assigns to B.1.1.222. Taking the decision tree rules in decision_tree_rules.zip at face value incorrectly assigns the sample to B.1.596, but then applying the proposed fix in the first post in this Issue of interpreting A as C, C as G, etc. now correctly results in B.1.1.222. I concur that it seems as though the symbols in decision_tree_rules.zip are off-by-one

emilyscher commented 3 years ago

I think the issue here might be that you're expecting the genome positions to be 0-index when they're in fact 1-indexed (which is normal for genomes, but not normal for if statements, which I know makes things confusing!). Could you let me know if interpreting the positions as 1-indexed fixes your issue?

danielmoney commented 3 years ago

Unless there's been an update to how this file is produced since my original report then, no, 0/1-indexing of the genome is not the problem. When I made the initial report, and as I say in it, the problem was the fact that the bases are effectively out by one, e.g. it has A instead of C, C instead of G etc. I think that does lead to the genomic position being off by one for decisions that are currently "-" (and only in that case) but that is not the underlying cause. Given the time that has past it would take me a little while to be absolutely certain that this is still the case but from @niemasd comment it seems like it is.

emilyscher commented 3 years ago

Unfortunately I'm still not able to recreate this. Could you possibly post your code?

danielmoney commented 3 years ago

Right lets first get the first feature from decision_tree_rules.txt: tail -n +4 tree_rules.txt | cut -f2 | cut -f1 -d, | uniq (strips the three header lines, then gets the decision rules, then gets the first rule and then returns the uniq values of the first rule). For pangoLEARN commit 34f420f44cc03ae059518516c508185bf2b67d91 this gives us:

16175!='A'
16175=='A'

which clearly suggests the rule is on whether or not 16175 is equal to A.

Now lets do it in code using the joblib files and code I've cobbled together from the pangolin source code:

import joblib

# Load the header and model
header_file = "decisionTreeHeaders_v1.joblib"
model_file = "decisionTree_v1.joblib"

model_headers = joblib.load(header_file)
loaded_model = joblib.load(model_file)

# Load the tree and get the feature id of the first node
tree = loaded_model.tree_
f = tree.feature[0]

# Taken from pangolin/pangolearn/pangolearn.py lines 135, 139-140
# Unless I'm much mistaken this takes a feature id and converts it into
# position_base (seems to be used as column headings for the dataframe that
# is passed to the model prediction)
indiciesToKeep = model_headers[1:]
categories = ['-','A', 'C', 'G', 'T']
columns = [f"{i}_{c}" for i in indiciesToKeep for c in categories]

# Prints the "column" id of the first feature
# This is 16175_C for pangloLEARN commit
# 34f420f44cc03ae059518516c508185bf2b67d91
print(columns[f])

This clearly suggests the feature is actually whether or not 16175 is equal to C (not A).

Looking at a few of these you see the pattern emerge that I mention in my first post. I've not yet been able to track down the actual bug in the code as I don't know the code base well enough.

niemasd commented 3 years ago

@emilyscher Are you sure the incides in tree_rules.txt within decision_tree_rules.zip are 1-based? The script I wrote can be found here:

https://github.com/niemasd/CovidQuant/blob/main/CovidQuant.py

All throughout, I'm using 0-based indices for genome positions, but I have a mapping where, if tree_rules.txt says A, I interpret that as C; if tree_rules.txt says C, I interpret that as G; etc.:

https://github.com/niemasd/CovidQuant/blob/main/CovidQuant.py#L25-L31

Here's where I actually use this dictionary:

https://github.com/niemasd/CovidQuant/blob/main/CovidQuant.py#L117

Here's the consensus sequence of the sample I'm playing with (my script is attempting to do some things purely from trimmed BAMs, but the issue can be reproduced with the consensus sequence):

B.1.1.222.txt (Pangolin assigns it to B.1.1.222)

Note that the BAM I'm using in the examples below are mapped to the reference genome that Pangolin uses:

https://github.com/cov-lineages/pangolin/blob/master/pangolin/data/reference.fasta

In the examples below, each line BEST CHOICE: (A, (B, C, D)) can be interpreted as follows:

If I assume 0-based indexing and take the tree_rules.txt symbols at face value (incorrectly yields B.1.596)

Here's the modified script: CovidQuant_1.py

BEST CHOICE: (6170, (16175, '!=', 'A'))
BEST CHOICE: (1123, (18423, '!=', '-'))
BEST CHOICE: (3409, (23591, '==', 'A'))
[2021-08-25 07:12:30] Best lineage: B.1.596

If I assume 1-based indexing and take the tree_rules.txt symbols at face value (incorrectly yields B.1.596)

Here's the modified script: CovidQuant_2.py

BEST CHOICE: (6172, (16175, '!=', 'A'))
BEST CHOICE: (1123, (18423, '!=', '-'))
BEST CHOICE: (3411, (23591, '!=', 'A'))
BEST CHOICE: (1833, (26304, '!=', 'C'))
BEST CHOICE: (490, (25243, '!=', 'G'))
BEST CHOICE: (1244, (11112, '!=', 'G'))
BEST CHOICE: (4159, (25562, '!=', 'C'))
BEST CHOICE: (854, (14023, '==', 'G'))
[2021-08-25 07:18:54] Best lineage: B.1.596

If I assume 0-based indexing and rotate the symbols in tree_rules.txt (correctly yields B.1.1.222)

By "rotate the symbols", I mean the following:

Here's the modified script: CovidQuant_3.py (which is equivalent to CovidQuant.py, but with additional print statements to show the decisions)

BEST CHOICE: (6173, (16175, '!=', 'C'))
BEST CHOICE: (1123, (18423, '==', 'A'))
BEST CHOICE: (346, (21986, '!=', 'A'))
BEST CHOICE: (1908, (26800, '!=', 'G'))
BEST CHOICE: (5816, (28882, '==', 'C'))
BEST CHOICE: (530, (23400, '!=', 'A'))
BEST CHOICE: (4838, (28168, '!=', 'G'))
BEST CHOICE: (1059, (5621, '!=', 'T'))
BEST CHOICE: (109, (24087, '==', 'T'))
BEST CHOICE: (1744, (23730, '!=', 'T'))
BEST CHOICE: (571, (5699, '!=', 'A'))
BEST CHOICE: (1979, (9071, '!=', 'T'))
BEST CHOICE: (416, (2341, '!=', 'G'))
BEST CHOICE: (5279, (15971, '==', 'A'))
BEST CHOICE: (30, (27298, '!=', 'C'))
BEST CHOICE: (343, (23755, '==', 'G'))
BEST CHOICE: (1173, (11116, '!=', 'G'))
BEST CHOICE: (5826, (28876, '==', 'A'))
BEST CHOICE: (9765, (27993, '==', 'A'))
BEST CHOICE: (346, (24196, '!=', 'T'))
BEST CHOICE: (1144, (25349, '!=', '-'))
BEST CHOICE: (2212, (14768, '!=', 'C'))
BEST CHOICE: (144, (4257, '==', 'T'))
BEST CHOICE: (5962, (27505, '==', 'G'))
BEST CHOICE: (3759, (7524, '!=', 'G'))
BEST CHOICE: (108, (25905, '!=', 'T'))
BEST CHOICE: (23, (21573, '!=', 'C'))
[2021-08-25 07:26:26] Best lineage: B.1.1.222

TL;DR

I'm fairly certain that tree_rules.txt within decision_tree_rules.zip is indeed using 0-based indices for genome positions, but that the symbols in the rules are incorrectly rotated by 1

danielmoney commented 3 years ago

Based on @niemasd comments it would seem I may well be wrong in my assumption that "-" would also have the wrong genome position (although I did say I'd never tested it).

niemasd commented 3 years ago

@danielmoney In this specific sample I'm playing with, there happen to not be any gaps at any of the positions in the decision tree conditions that are reached, nor +/- 1 position over (to account for off-by-one in indices), so I think the - corner case is simply not being hit in this specific dataset I'm using. So I'm not sure if your assumption is right or wrong based solely on my dataset (if I'm understanding what you mean)