etetoolkit / ete

Python package for building, comparing, annotating, manipulating and visualising trees. It provides a comprehensive API and a collection of command line tools, including utilities to work with the NCBI taxonomy tree.
http://etetoolkit.org
GNU General Public License v3.0
773 stars 216 forks source link

How to loop multiple trees in a single newick file #652

Open singing-scientist opened 1 year ago

singing-scientist commented 1 year ago

Apologies if this has been asked and answered: if I have a newick file with multiple trees (one per line), is there a straightforward way to import and loop through them one at a time? In my experimenting with ete3, if I import such a newick as my_tree = Tree(".newick"), it seems to store it as a single tree (and I'm not sure how that's possible).

Many thanks for any help with this (likely basic) question! Chase

jhcepas commented 1 year ago

ete3 must be parsing only the first tree in your list. To load all of them you should do something like:

from ete3 import Tree
trees = []
for line in open('mytrees.newick'):
    t = Tree(line)
    trees.append(t)
singing-scientist commented 1 year ago

Thanks very much, @jhcepas ! If I'm understanding correctly, it's actually considering it one big tree (more likely a tree sequence?) because it's reporting the total number of leaves across all trees (20 trees x 6264 = 125280):

>>> from ete3 import Tree
>>> my_tree = Tree("raxml.mlTrees")
>>> my_tree.describe()
Number of leaf nodes:   125280
Total number of nodes:  250501
Rooted: No
Most distant node:  D3|IRC200189
Max. distance:  0.260229

Thanks so much for the suggestion; I will not open the file directly with Tree (since I don't understand what it's doing), but instead loop the lines and read each line separately using Tree.

With gratitude, Chase

jhcepas commented 1 year ago

wow, that's interesting. In priniciple, ETE will consider the end of a newick tree at the ';' symbol. what's the format of your raxml.mlTrees file? one newick per line?

singing-scientist commented 1 year ago

Indeed, that's exactly what I expected too! Yes, the format is one newick tree per line:

(((C1|IRC202059:0.000653,(C1|PAP2664:0.0 ... 0.000612):0.000980):0.002181):0.001470);
((A1|PAP230848:0.001037,A1|PAP154594:0.0 ... 1|PAP2230:0.000001):0.000001):0.000001);
((A1|PAP102078:0.001252,((((A1|PAP1113:0 ... 002680):0.000001,A1|PAP231566:0.000001);
((A3|PAP194931:0.000001,A3|IRC203769:0.0 ... PAP244882:0.000613):0.000001):0.000001);
((A1|PAP256745:0.000599,(A1|IRC201679:0. ... 0.003449,A1|SCD1362:0.000614):0.001205);
(((A1|SCD2643:0.000001,A1|IRC201643:0.00 ... 0.010375):0.003761):0.000600):0.000001);
(((((A1|IRC201629:0.004009,(((((A1|IRC20 ... 000973):0.001007,A1|PAP156465:0.000001);
((A4|PAP176102:0.001198,(((A4|IRC201637: ... 261|212CG:0.001945):0.000963):0.000001);
((((D4|IRC201739:0.002448,(((C2|IRC20054 ... 010354):0.007279,D4|IRC200643:0.000001);
((((A1|SCD2048:0.003091,A1|PAP111730:0.0 ... 0.001251):0.001229):0.000001):0.000001);
((D2|PAP0372:0.000001,((((((D2|PAP2336:0 ... 000001,D2|PAP221420:0.000001):0.000001);
(((A1|PAP3220:0.001927,(A1|IRC202133:0.0 ... 0.000001):0.000001):0.000001):0.000001);
((A1|IRC200092:0.002336,A1|SCD5803:0.001 ... PAP157935:0.001802):0.000582):0.000001);
(((A1|PAP242284:0.001742,(((A1|PAP119879 ... PAP167532:0.000001):0.000001):0.000001);
(((A1|IRC201915:0.000569,A1|IRC200620:0. ... IRC200513:0.002852):0.000001):0.000001);
((((((NA|IRC201639:0.008498,A1|IRC202151 ... IRC201078:0.001376):0.002153):0.000001);
(((A1|PAP277589:0.001174,(((((A1|Qv35943 ... 0.000001):0.000001,A1|PAP2436:0.000572);
((((A1|SCD2969:0.000001,((A1|SCD2720:0.0 ... 000602):0.000571,A1|PAP289437:0.000001);
(((((A1|IRC201668:0.003988,A1|SCD2323:0. ... 0.006509,A1|SCD1445:0.000657):0.000611);
((A1|IRC201000:0.000001,(A1|PAP254496:0. ... C200816:0.000001,A1|IRC200997:0.000001);

I am so new to ete3 I do not trust myself to understand how the structures work, but I had expected inputting this to result in some sort of list of trees.