Understanding branch numbering

lkelly3 commented 6 years ago

Hello

I would like to be able to run grand-conv on a large number of genes (c. 3000) for the same set of taxa and then pull out the residuals and PPs for divergent/convergent sites for specific taxon pairs. I have run a few tests with individual loci and can generate all the values I need, but for the taxon pairs I am interested in the branch numbering apparently differs between loci so that I could not use the same command to get the values I want for all loci, e.g. ./bin/gc-discover --dir=output --nthreads=2 --branch-pairs="(8,9)" will pull out values for different taxon pairs for different genes.

Can you tell me how the branch numbering is decided? I am using the same input tree file for all loci and each phy file contains sequences from the same set of taxa. Is there a way to set the branch numbering so that the taxon pairs I am interested in will be numbered consistently across all loci?

Also, although the expected numbers of convergent and divergent substitutions are included in the "branch-totals.out" file, I can't see the residuals reported in any of the plain text results files. Is it possible to output these as well? I.e. the full contents of the "Branch Pairs Table" that can be viewed via the "index.html" file.

I would be very grateful for any help you can give on this! Laura

jasondk commented 5 years ago

Hi Laura, I'm very sorry we missed this comment from a long time ago. In case others have similar issues:

1) Node numbering If you're running multiple data sets on the same tree, it is important that the sequence order is the same for each alignment. This is because node numbering is performed following PAML's convention, which is based on ordering in the alignment. For reference, tip nodes are numbered from 0..n-1 for n input sequences, in input order. Then doing a post-order traversal starting from the first sequence in the alignment, nodes are labelled tip-to-root sequentially (in order visited) starting at n+1.

Usually when we run gc, we do a dummy analysis first to get some sample output with our input data. You can then use the visual explorer to find branches of interest and write down their node numbers. Then you can re-run analyses to your heart’s content, using those numbers. As long as the alignment order is the same, you can apply those node id’s across datasets even.

2) Obtaining a list of residuals This data is not output in separate file, but you can calculate them yourself from data in output/UI/User/indexData.js. Regression coefficients are given there as regressionSlope and regressionIntercept. The expected number of convergent and divergent substitutions are given in yPoints and xPoints. So you can calculate the residuals simply as: yPoints - (regressionSlope * xPoints + regressionIntercept) with a custom script. In an upcoming release, we provide the residuals data directly for user convenience.

lkelly3 commented 5 years ago

Thanks for your reply. Your solution to my question about node numbering is what I ended up doing in the end; many thanks for the solution for the residuals - this will be very useful for the future!

dekoning-lab / grand-conv

Understanding branch numbering #2