MDU-PHL / pango-watch

https://mdu-phl.github.io/pango-watch/
1 stars 3 forks source link

inconsistencies between the readme and the changes in the tree data #2

Closed ghost closed 1 year ago

ghost commented 1 year ago

1) the readme has recombinants, the tree (data.json) doesn't

I've recognized first time seeing that readme has XBG but searching through data.json I don't find it.

More obviously: In commit 36191b4 there's

2022-11-17

+ XBH Recombinant lineage

And the tree remains unchanged, last time changed day before in commit 0524792.

ghost commented 1 year ago

It seems (1) is "by design":

https://github.com/MDU-PHL/pango-watch/blob/main/app.py

    if lineage.startswith('X'): # remove recombinants 
        continue

So probably the solution is documenting it?

ghost commented 1 year ago

2) In 3edf96d readme lists

2022-12-03
     + [BQ.1.1.23](https://cov-lineages.org/lineage.html?lineage=BQ.1.1.23)

but AFAIK BQ.1.1.23 was introduced early November, and since also BQ.1.1.24, BQ.1.1.25, BQ.1.1.26 and the new one is only BQ.1.1.27?

https://github.com/cov-lineages/pango-designation/commits/master/lineage_notes.txt

ghost commented 1 year ago

At least partially solved, recombinants are included in the tree since https://github.com/MDU-PHL/pango-watch/commit/a20648c7821fdb2bb55392243e8581040e6ff304

ghost commented 1 year ago

But trying as an example XBB: Per https://github.com/cov-lineages/pango-designation/blob/master/lineage_notes.txt it is "XBB Recombinant lineage of BJ.1 and BA.2.75 with breakpoint in S1, found in USA and Singapore, from issue #1058"

But:

image

Its child list starts with [ B.1.1.529.2.10.1 (BA.2.10.1),

Its child list: [ B.1.1.529.2.10.1.1 (BJ.1), XBB <BA.2.10.1+BM.1.1.1>,

I.e the BJ.1 and XBB appear to be both children of BA.2.10.1. Then XBB has its child list

[ XBB.1, 

etc.

Current output of

 pretty-tree.pl | grep "+" | sort
XA <B.1.1+B.1.177>,  
XAA <B.1.1.529+BA.2>,  
XAB <B.1.1.529+BA.2>,  
XAC <B.1.1.529+BA.1>,  
XAD <B.1.1.529+BA.1>,  
XAE <B.1.1.529+BA.1>,  
XAF <B.1.1.529+BA.2>,  
XAG <B.1.1.529+BA.2>,  
XAH <B.1.1.529+BA.1>,  
XAJ <BA.2.12+BA.4>,  
XAK <B.1.1.529+BA.1>,  
XAL <B.1.1.529+BA.2>,  
XAM <BA.1+BA.2.9>,  
XAN <B.1.1.529+BA.5.1>,  
XAP <B.1.1.529+BA.1>,  
XAQ <B.1.1.529+BA.2>,  
XAR <B.1.1.529+BA.2>,  
XAS <B.1.1.529+BA.2>,  
XAT <BA.2.3+BA.1>,  
XAU <BA.1+BA.2.9>,  
XAV <B.1.1.529+BA.5>,  
XAW <B.1.1.529+AY.122>,  
XAY <B.1.617.2+BA.4>,  
XAZ <BA.2+BA.5>,  
XB <B.1+B.1.631>,  
XBA <B.1.617.2+BA.4>,  
XBB <BA.2.10.1+BM.1.1.1>,  
XBC <B.1.1.529+B.1.617.2>,  
XBD <BA.2.75+BF.5>,  
XBE <BA.5+BE.4.1>,  
XBF <BA.5.2+CJ.1>,  
XBG <BA.2+BA.5.2>,  
XBH <BA.2.3+BA.2.75.2>,  
XBJ <BA.2.3+BA.5.2>,  
XBK <BA.5+CJ.1>,  
XBL <XBB+BA.2.75>,  
XBM <BA.2+BF.3>,  
XBN <BA.2+XBB.3>,  
XBP <BA.2+BQ.1>,  
XC <B.1.617.2+B.1.1.7>,  
XD <B.1.617+BA.1>,  
XE <B.1.1.529+BA.2>,  
XF <B.1.617+BA.1>,  
XG <B.1.1.529+BA.2>,  
XH <B.1.1.529+BA.2>,  
XJ <B.1.1.529+BA.2>,  
XK <B.1.1.529+BA.2>,  
XL <B.1.1.529+BA.2>,  
XM <BA.1+BA.2>,  
XN <B.1.1.529+BA.2>,  
XP <BA.1+BA.2>,  
XQ <BA.1+BA.2>,  
XR <BA.1+BA.2>,  
XS <B.1.617+BA.1.1>,  
XT <B.1.1.529+BA.1>,  
XU <B.1.1.529+BA.2>,  
XV <B.1.1.529+BA.2>,  
XW <B.1.1.529+BA.2>,  
XY <B.1.1.529+BA.2>,
XZ <B.1.1.529+BA.1>,

Whereas: "XA Recombinant lineage with parental lineages B.1.1.7 and B.1.177" etc.

So it seems one of the ancestors of the recombinants is currently always wrong in data.json (e.g. for XBB: BA.2.10.1 instead of BJ.1 listed in the lineage_notes.txt).

Wytamma commented 1 year ago

Hey @janko-js! Sorry I missed this issue (GitHub’s notification system is terrible :/)! Thanks so much for pointing this out. I’ll be away for a few weeks but will fix this once I’m back. Will be happy to merge a PR if you have one :)

Wytamma commented 1 year ago

Hi @janko-js I think it's fixed now! Thanks for spotting. Please reopen if i missed something.

ghost commented 1 year ago

Sorry, it still appears wrong.

pango-watch-bug-2

Three parents of XBL? XBB XBB.1 and BA.2.75

But only 2 (XBB.1 and BA.2.75) mentioned in:

https://github.com/cov-lineages/pango-designation/blob/master/lineage_notes.txt

"XBL Recombinant lineage of XBB.1 with S:F486P and BA.2.75, Malaysia, from issue #1532"

https://github.com/cov-lineages/pango-designation/issues/1532

https://github.com/ktmeaton/ncov-recombinant/issues/219

I've seen it as I've compared the output of my script

XBL <XBB+BA.2.75>,  

with the line:

"XBL Recombinant lineage of XBB.1 with S:F486P and BA.2.75, Malaysia, from issue #1532"

My script always prints just the two parents, and in this case it extracted the XBB from your json.

Wytamma commented 1 year ago

Ah! Thanks for following up. I think that’s related to multiple XBB.1 in the key. I think a .unique() will fix it! Will try that now

Wytamma commented 1 year ago

Okay I think NOW it is fixed 😅 abce529

ghost commented 1 year ago

Sorry, it's still wrong. You can use my script to generate the text tree from the local data.json and easily compare the info about the recombinants :

$ pretty-tree.pl | grep XBL
XBL <XBB+BA.2.75>,

vs.

"XBL Recombinant lineage of XBB.1 with S:F486P and BA.2.75, Malaysia, from issue #1532"

And also directly seeing the data.json, it's visible that XBL is a sibling to XBB.8 and not a child of XBB.1:

image

And the corresponding (equivalent) information to the pictured data.json part, after processing the data.json with my script: ("XBB.5 is a child of XBB, XBB.6 too but it has a child XBB.6.1, then XBB.7 and XBB.8 are childless, and XBL is (falsely) a child of XBB in that data.json (it should be a child of XBB.1 per https://github.com/ktmeaton/ncov-recombinant/issues/219 and lineage notes):

XBB <BM.1.1.1+BJ.1>,  

...

XBB.5, 
XBB.6, 
[ XBB.6.1, 
]

XBB.7, XBB.8, 
XBL <XBB+BA.2.75>,  
]

Interestingly

https://github.com/cov-lineages/pango-designation/blob/master/pango_designation/alias_key.json

for some reason has, for me unexpectedly, multiple entries for parents, but the parent is still XBB.1 and not XBB

"XBL": ["XBB.1","BA.2.75","XBB.1"],

(Tangentially: You can also compare the "before" and "after" of the text tree:

https://github.com/janko-js/variants_text_tree/commit/f022e23158d2e680d16cf64bd4dc0f4648e26888

Note that previously your first parent of XBB was BJ.1 and now you connected it to the BM.1.1.1 first, so the whole "subtree" appeared on another place in that representation. The similar swap happened with XBF (CJ.1+BA.5.2.3 now, BA.5.2.3+CJ.1 before), XBP, ... etc. I know that the both parents are of the same importance, it's just that not having an order but leaving it to the randomness makes the automatic comparisons unnecessarily harder as for the stored representation the first and the other parent have different appearance, one being implicit from the "tree", another being the attribute.)

Wytamma commented 1 year ago

Ah yes there are a few issues here...

The tree/data.json file is a hack, it is used to generate a D3js hierarchy. The D3js hierarchy has no concept of nodes with multiple parents (i.e. it is a tree not a graph). I had to hack the layout to add recombinants. So the data.json file wont make sense unless you process the otherParents key correctly. I now generate graph/data.json which is an actual graph structure and so makes sense for recombinants.

The multiple parents in the alias_key are the break points i.e. the middle of XBL is BA.2.75.

I can fix the ordering by sorting the parents list so the the recombinant is always first. This may change what it's like now but will at least be consistent from now on.

Thanks for persisting @janko-js

Wytamma commented 1 year ago

I think I've fixed it now... but will leave it up to your keen eyes @janko-js (3ab84c8). I've checked and XBL is XBB.1 and BA.2.75. I would use the graph/data.json as the lineages make more sense as a graph when recombinants are included.

ghost commented 1 year ago

I have an impression it's still open, sorry:

Running my script on your json I get:

 XBB <BM.1.1.1+BJ.1>,  

BM.1.1.1?

but lineage_notes.txt:

 XBB    Recombinant lineage of BJ.1 and BA.2.75 with breakpoint in S1, found in USA and Singapore, from issue #1058

i.e. I'd expect BA.2.75 to be there, not BM.1.1.1 ?

ghost commented 1 year ago

I'd also like if you manage to insert the recombinants in the tree always via the longer path (i.e. to the parent which has the most 'dots' in the full name, and internally to track the "number of dots" ("the longest path") even for the recombinants of the recombinants). That would guarantee the ultimate consistency and would also give some clarity about the minimal "naming distance" of every recombinant. I haven't checked if you're already doing that, as the first step is to have the parents which match lineage_notes.txt I'm sorry for asking that, but I believe it is actually giving more meaning to the "tree". I understand you like graph, but to me the consistent tree can say more about the history of the recognition of the pango subvariants, maybe you'll like the idea too.

Wytamma commented 1 year ago

Hmm 🤔 looks like you might have found a bug in pango-designation as the alias_key list "XBB": ["BJ.1","BM.1.1.1"], https://github.com/cov-lineages/pango-designation/blob/7f20135411fec880f89a5571f2a4656bb29d5f12/pango_designation/alias_key.json#L174. Maybe worth opening an issue with them?

I will try to sort out the ordering as you state above. Thanks again!

ghost commented 1 year ago

How to compare the lineage notes and the tree data.json (recomb-compared.txt is produced):

https://gist.github.com/janko-js/3eb2ea9a7e504a27d24e219d3dafa993

Wytamma commented 1 year ago

Awesome 🙏

ghost commented 1 year ago

Thanks! I also think them mentioning two times the same parent is also an issue, and for your program you should keep preserving only the "unique" parents.

ghost commented 1 year ago

It seems, seeing their issue https://github.com/cov-lineages/pango-designation/issues/1058 it's indeed BM.1.1.1 so for that specific case it's lineage_notes.txt not updated.

And I think they also should not mention the same parent twice, haven't investigated why that's there.

Wytamma commented 1 year ago

I think the double up in the parents are the break points of the recombinant

ghost commented 1 year ago

So it seems the parents match now! Congratulations.

Regarding the ordering I've suggested: I believe it would result in the "tree" with the branches growing "as far as possible" and in a way consistent for the recombinants: the "later" contributing parents would always be from the same level or earlier, and the first would consistently place the variant at least as far away from the "naming root" as the variant with the "most dots" in the longest path allows. I never thought about that solution until I've played with the tree you produce, but I think it has sense. Thanks for all your work!

ghost commented 1 year ago

The comparison using the alias_key.json

https://gist.github.com/janko-js/6ac001b4a2862d3d4cb8e420a8d5c7cb

Processing with it both alias_key.json and tree/data.json processed with pretty-tree.pl produce at the end the same output (the lines are modified to be split into the relevant "words" which are then made unique, sorted backwards in every line and printed in the same manner):

XA B.1.177 B.1.1.7 
XAA BA.2 BA.1 
XAB BA.2 BA.1 
XAC BA.2 BA.1 
XAD BA.2 BA.1 
XAE BA.2 BA.1 
XAF BA.2 BA.1 
XAG BA.2 BA.1 
XAH BA.2 BA.1 
XAJ BA.4 BA.2.12.1 
XAK BA.2 BA.1 
XAL BA.2 BA.1 
XAM BA.2.9 BA.1.1 
XAN BA.5.1 BA.2 
XAP BA.2 BA.1 
XAQ BA.2 BA.1 
XAR BA.2 BA.1 
XAS BA.5 BA.2 
XAT BA.2.3.13 BA.1 
XAU BA.2.9 BA.1.1 
XAV BA.5 BA.2 
XAW BA.2 AY.122 
XAY BA.2 AY.45 
XAZ BA.5 BA.2.5 
XB B.1.634 B.1.631 
XBA BA.2 AY.45 
XBB BM.1.1.1 BJ.1 
XBC BA.2 B.1.617.2 
XBD BF.5 BA.2.75.2 
XBE BE.4.1 BA.5.2 
XBF CJ.1 BA.5.2.3 
XBG BA.5.2 BA.2.76 
XBH BA.2.75.2 BA.2.3.17 
XBJ BA.5.2 BA.2.3.20 
XBK CJ.1 BA.5.2 
XBL XBB.1 BA.2.75 
XBM BF.3 BA.2.76 
XBN XBB.3 BA.2.75 
XBP BQ.1 BA.2.75 
XBQ CJ.1 BA.5.2 
XBR BQ.1 BA.2.75 
XBS BQ.1 BA.2.75 
XBT BA.5.2.34 BA.2.75 
XC B.1.1.7 AY.29 
XD BA.1 B.1.617.2 
XE BA.2 BA.1 
XF BA.1 B.1.617.2 
XG BA.2 BA.1 
XH BA.2 BA.1 
XJ BA.2 BA.1 
XK BA.2 BA.1 
XL BA.2 BA.1 
XM BA.2 BA.1.1 
XN BA.2 BA.1 
XP BA.2 BA.1.1 
XQ BA.2 BA.1.1 
XR BA.2 BA.1.1 
XS BA.1.1 B.1.617.2 
XT BA.2 BA.1 
XU BA.2 BA.1 
XV BA.2 BA.1 
XW BA.2 BA.1 
XY BA.2 BA.1 
XZ BA.2 BA.1 

So I'm quite sure the information in both matches now.

ghost commented 1 year ago

I think this issue can be closed. If some new inconsistency occurs, a new issue can be opened, at the moment the tree appears consistent.

Wytamma commented 1 year ago

Excellent! Cheers @janko-js