evolbioinfo / gotree

Gotree is a set of command line tools and an API to manipulate phylogenetic trees. It is implemented in Go language.
GNU General Public License v2.0
120 stars 15 forks source link

Possible bug with gotree prune #16

Closed rmcolq closed 2 years ago

rmcolq commented 2 years ago

I have been running gotree prune on a 2 million tip SARS-CoV-2 tree to remove what is now quite a long list of tips. For some tips it crashes, but for all of these I can manually update the newick file to successfully prune the tree. I noticed that the strings I am deleting manually all look very similar:

Command run: gotree prune -i cog_global.2021-11-18.clean.tree -f cog_global_master.filtered.clean.tips.txt -r -o "cog_global.2021-11-18.clean.pruned.newick"

Error message:

Command error:
  2021/11/25 19:46:52 Newick : Support values attached to root node are ignored
  2021/11/25 19:46:52 Newick : Branch lengths attached to root node are ignored
  [Error] in cmd/prune.go (line 137), message: After tip removal, this node should not have degre 1 without being the root
  Error: After tip removal, this node should not have degre 1 without being the root

Examples of the newick substrings which result in successful removal:

,(South_Africa/NICD-N5963/2021:6)14061:8
(Italy/CAM_IZSM_RD20702688967/2021:0)426532:12,
,(Italy/CAM_IZSM_RD20702444133/2021:0)276441:1
,(USA/NY-PA-PBRI-CT-1137/2021:0)164683:1
(USA/OR-TRACE-POLK-011821-680/2021:1)474168:2,
,(Italy/PIE-15858188/2021:2)418163:1
,(Italy/CAM_IZSM_RD20702444262_CAM_IZSM_COLLI_TIGEM/2021:8)198146:2
,(USA/NJ-CDC-IBX578468886599/2021:9)485193:3
,(England/PHEC-3K03FKB2/2021:12)320221:1
,(USA/OR-OHSU-10264/2021:1)460724:1
,(Italy/CAM_IZSM_RD20702644357/2021:7)173146:5
,(USA/WA-UW-63337/2021:0)488838:3
,(((Italy/CAM_IZSM_RD20702444259_CAM_IZSM_COLLI_TIGEM/2021:17)198151:5)198150:2)198147:1
,(USA/FL-CDC-ASC210122855/2021:0)456085:2
(((Italy/CAM_IZSM_RD00363110/2021:18)429187:7)429182:2)429181:14,
(Mexico/SON_InDRE_FB45320_S8846/2021:8)144466:5,
,(England/PHEC-3K040K63/2021:6)320222:6

ie it looks like the problem is being caused by the tip being the only descendent of an internal node. I don't know enough about newick format to really understand if this meets the spec or not, so I assume it does make sense (in any case, it would be a real headache to regenerate the tree given that it now has 2 million sequences in it and is currently being updated with UShER every week, then some tips pruned out with gotree).

  1. Could this be being caused by gotree's pruning method run hundreds of times over to remove tips?
  2. If so, could it be handled by gotree pruning?

As a secondary fix/handle attempt, I plan to try and parse the string to remove the internal nodes with a single descendant, but I don't know if the resulting tree will be handled by UShER/other tree tools.

fredericlemoine commented 2 years ago

Thanks for your report. I can investigate this issue, but would it be possible to share the files that generate this error message (tree+tip list)? To remove internal nodes with a single descendant, there is the command : gotree collapse single. But gotree prune applied to a huge number of tips should not generate this kind of patterns.

Update: I just tried to prune 200 000 random tips from the cog tree (this one) without any problem.

Update2: I just had a look at your newick string. There are indeed internal nodes with only one single child. Example:

((USA/NJ-CDC-IBX578468886599/2021:9)485193:3
,(England/PHEC-3K03FKB2/2021:12)320221:1
,(USA/OR-OHSU-10264/2021:1)460724:1);

Represents:

|--3---o---9---USA/NJ-CDC-IBX578468886599/2021
|
o--1---o---12----England/PHEC-3K03FKB2/2021
|
|--1---o---1-----USA/OR-OHSU-10264/2021:1)460724:1

Calling: gotree collapse single on that tree will give:

|--12---USA/NJ-CDC-IBX578468886599/2021
|
o--13----England/PHEC-3K03FKB2/2021
|
|--2-----USA/OR-OHSU-10264/2021:1)460724:1
rmcolq commented 2 years ago

It's the COG tree pipeline which I'm running - I've only had one tip before which was problematic for removal. But there have been >50 in the latest tree and I'm trying to work out what's going on - the pipeline only completed once in the last 2 weeks so there were a lot of new additions to the tree last round. I'll investigate my tree further before following up with more details if there is still a problem.

rmcolq commented 2 years ago

It also sounds like gotree collapse single might resolve problems with this tree instance anyway - thank you!

rmcolq commented 2 years ago

It looks like something about the way UShER works resulted in the internal nodes with only one single child. Apparently this does cause a problem with pruning in gotree prune. Running gotree collapse single resulted in a tree that could be pruned.

I was surprised by the difference between the newick file I was inputting and the one you linked to that was publicly available. The only difference is that I run a step to rescale the branch lengths (which must have internally collapsed the problematic nodes) before publishing to the website.

Thanks for the help! My problem is now resolved! It now looks like this is a potential optimisation for gotree prune to handle trees like this if you want to, or feel free to close the issue. If you want to take it forward, I have a tree with 4 known problematic tips in it that I can send you.

fredericlemoine commented 2 years ago

Ok, glad it worked.

gotree prune could handle that indeed, but I am not sure what you would expect as an output. Would it be a tree without the single nodes? Otherwise, keeping them while removing tips would generate tons of single node paths in the tree.

I will already start by adding a message to the error message advising to run gotree collapse single.

rmcolq commented 2 years ago

Your suggestion of updating the error message sounds like a good solution. If possible would also be helpful to print the tip label at which the problem occurred - I hadn't realised this was the problem until I manually checked lots which involved running gotree prune on slices of tips to isolate the problem ones. If you do update prune, I think the output would be as if you had collapsed that tip with the internal nodes of which it is the sole child, then removed it.

fredericlemoine commented 2 years ago

I see. I updated gotree prune to remove the internal nodes that become tips after pruning. There are still cases that will give errors (with a better message I hope), but in the general cases it should work. Let me know if it works, you can test it in the 0.4.3a pre release

fredericlemoine commented 2 years ago

I close the issue, feel free to reopen it if needed.