Closed rmcolq closed 2 years ago
Thanks for your report.
I can investigate this issue, but would it be possible to share the files that generate this error message (tree+tip list)?
To remove internal nodes with a single descendant, there is the command : gotree collapse single
.
But gotree prune applied to a huge number of tips should not generate this kind of patterns.
Update: I just tried to prune 200 000 random tips from the cog tree (this one) without any problem.
Update2: I just had a look at your newick string. There are indeed internal nodes with only one single child. Example:
((USA/NJ-CDC-IBX578468886599/2021:9)485193:3
,(England/PHEC-3K03FKB2/2021:12)320221:1
,(USA/OR-OHSU-10264/2021:1)460724:1);
Represents:
|--3---o---9---USA/NJ-CDC-IBX578468886599/2021
|
o--1---o---12----England/PHEC-3K03FKB2/2021
|
|--1---o---1-----USA/OR-OHSU-10264/2021:1)460724:1
Calling: gotree collapse single
on that tree will give:
|--12---USA/NJ-CDC-IBX578468886599/2021
|
o--13----England/PHEC-3K03FKB2/2021
|
|--2-----USA/OR-OHSU-10264/2021:1)460724:1
It's the COG tree pipeline which I'm running - I've only had one tip before which was problematic for removal. But there have been >50 in the latest tree and I'm trying to work out what's going on - the pipeline only completed once in the last 2 weeks so there were a lot of new additions to the tree last round. I'll investigate my tree further before following up with more details if there is still a problem.
It also sounds like gotree collapse single
might resolve problems with this tree instance anyway - thank you!
It looks like something about the way UShER works resulted in the internal nodes with only one single child. Apparently this does cause a problem with pruning in gotree prune
. Running gotree collapse single
resulted in a tree that could be pruned.
I was surprised by the difference between the newick file I was inputting and the one you linked to that was publicly available. The only difference is that I run a step to rescale the branch lengths (which must have internally collapsed the problematic nodes) before publishing to the website.
Thanks for the help! My problem is now resolved! It now looks like this is a potential optimisation for gotree prune
to handle trees like this if you want to, or feel free to close the issue. If you want to take it forward, I have a tree with 4 known problematic tips in it that I can send you.
Ok, glad it worked.
gotree prune could handle that indeed, but I am not sure what you would expect as an output. Would it be a tree without the single nodes? Otherwise, keeping them while removing tips would generate tons of single node paths in the tree.
I will already start by adding a message to the error message advising to run gotree collapse single
.
Your suggestion of updating the error message sounds like a good solution. If possible would also be helpful to print the tip label at which the problem occurred - I hadn't realised this was the problem until I manually checked lots which involved running gotree prune on slices of tips to isolate the problem ones. If you do update prune, I think the output would be as if you had collapsed that tip with the internal nodes of which it is the sole child, then removed it.
I see. I updated gotree prune to remove the internal nodes that become tips after pruning. There are still cases that will give errors (with a better message I hope), but in the general cases it should work. Let me know if it works, you can test it in the 0.4.3a pre release
I close the issue, feel free to reopen it if needed.
I have been running
gotree prune
on a 2 million tip SARS-CoV-2 tree to remove what is now quite a long list of tips. For some tips it crashes, but for all of these I can manually update the newick file to successfully prune the tree. I noticed that the strings I am deleting manually all look very similar:Command run:
gotree prune -i cog_global.2021-11-18.clean.tree -f cog_global_master.filtered.clean.tips.txt -r -o "cog_global.2021-11-18.clean.pruned.newick"
Error message:
Examples of the newick substrings which result in successful removal:
ie it looks like the problem is being caused by the tip being the only descendent of an internal node. I don't know enough about newick format to really understand if this meets the spec or not, so I assume it does make sense (in any case, it would be a real headache to regenerate the tree given that it now has 2 million sequences in it and is currently being updated with UShER every week, then some tips pruned out with gotree).
As a secondary fix/handle attempt, I plan to try and parse the string to remove the internal nodes with a single descendant, but I don't know if the resulting tree will be handled by UShER/other tree tools.