BenoitMorel / ParGenes

A massively parallel tool for model selection and tree inference on thousands of genes
GNU General Public License v3.0
38 stars 5 forks source link

Not an issue just a question/request #67

Open herber4 opened 3 years ago

herber4 commented 3 years ago

Dear Pargenes,

Is there a feature in Pargenes that allows you to sort trees by the number of taxa in each tree?

For example, most clustering programs will output single copy core gene alignments with genes from all taxa but some of the SCCG's will have segmented genes (resulting in alignment files with > n taxa) due to erroneous assembly or poor genome quality.

If I have all of the bootstrap trees and only want to cat the ones with say a taxa count of 23, what is the best way to do so?

Best,

Austin Herbert Clemson Plant and Environmental Sciences

BenoitMorel commented 3 years ago

Dear Austin,

Regarding the sorting: there is no such feature in ParGenes. How would you expect the sorted trees to be outputted? In one file with all newick trees sorted by size? Or do you want the ones that exceed a given size to be filtered out?

Here is a one-line command for printing all trees with exactly 23 taxa. The number of taxa is the number of left parentheses plus one in the newick string. It should work on linux and mac (but test it first, I only tried it on a few examples).

awk '{N = 23; if ((split($0,a,"(")-1) == (N - 1)) print $1}'  file_with_all_trees.newick

In general, I would recommend investing the time to learn any language script (I think python and R are quite popular among biologists) with any phylogenetic library (for instance dendropy or ete3 in python). This would allow you to easily perform any "simple" preprocessing or preprocessing and save you a bunch of time :-)

Best, Benoit

Best,

herber4 commented 3 years ago

Hello Benoit,

Thanks for the input. unfortunately, the awk command only outputs 23 total trees and not all trees with 23 taxa. For my sake, the bootstrap values can't be mapped to a reference tree due to the different number of taxa in many of the gene trees, which is why I am still seeking a way to efficiently filter out the files with more/less than the correct taxa value.

Thanks,

Austin

nylander commented 1 year ago

Hi, The awk command from Benoit will work (tested with GNU Awk 5.1.0). Be aware, however, that the counting of parentheses differ if the trees are unrooted or not:

awk '{N = 23; if ((split($0,a,"(")-1) == (N - 1)) print $1}' rooted.trees
awk '{N = 23; if ((split($0,a,"(")) == (N - 1)) print $1}' unrooted.trees

/Johan