How can I find out number of patterns and number of partitions from my input file without running Beast2?

CompEvol / beast2

Bayesian Evolutionary Analysis by Sampling Trees

www.beast2.org

GNU Lesser General Public License v2.1

241 stars 84 forks source link

How can I find out number of patterns and number of partitions from my input file without running Beast2? #1134

Open mzhuangsdsc opened 1 year ago

mzhuangsdsc commented 1 year ago

Hello, I have a Beast2 input file. I know if I run the Beas2 jar, I can find out number of patterns and number of partitions from the log file. Is there a script somewhere that can give me that information without running the Beast2 jar? Or can somebody tell me how to write a simple script to get number of patterns and number of partitions from my input file?

Thank you very much!

rbouckaert commented 1 year ago

If you run beast -validate beast.xml from the command line, BEAST will parse the XML but will not start the MCMC. It prints out the pattern and site counts for each of the alignments in the XML, just like when starting any other BEAST run. Would that be sufficient for what you need?

achourasia commented 1 year ago

@rbouckaert we would like to fetch two specific elements: pattern and site counts from the input file in a php application. If you could provide any pointers/rules that could be used to parse this information from xml file that would be great, as we don't know much about the structure of the input file and its biological interpretation.

rbouckaert commented 1 year ago

Not familiar with php, but I suppose it can launch an application and parse its output. If so, you could install the Babel package for BEAST 2 and run something like applauncher Nexus2Fasta -in alignment.nex -out /dev/null | grep patterns which converts the alignment to fasta, but as side effect prints out the number of patters (and taxa and sites). You could also write your own package and start with the code for Nexus2Fasta, which is here: https://github.com/rbouckaert/Babel/blob/master/src/babel/tools/Nexus2Fasta.java and remove the parts for exporting fasta.

achourasia commented 1 year ago

Thanks for additional information. We are unable to install other tools and run them on our server, so we need to parse the input XML file which is easy to do in PHP. However, we don't know what to look for in the XML file, if you or someone could provide us any hints on which structures to pull out from XML file and combine them to identify partitions and patterns count, that will do the trick for us.

rbouckaert commented 1 year ago

I see: what you are looking for are the alignments, which typically have the attribute spec="Alignment". Each alignment contains sequences in sequence elements. The sequence data can be found in the value attribute of the sequence elements. If there are partitions (like splits on codon positions, or for different genes), there may be elements with attribute spec="FilteredAlignment" and a reference to the main alignment. Further, there is a filter attribute that specifies which sites to select from the main alignment as follows:

            First site is 1.
            Filter specs are comma separated, either a singleton, a range [from]-[to] or iteration [from]:[to]:[step]; 
            1-100 defines a range, 
            1-100\3 or 1:100:3 defines every third in range 1-100, 
            1::3,2::3 removes every third site. 
            Default for range [1]-[last site], default for iterator [1]:[last site]:[1]

When BEAST runs, it shows information from main alignments as well as filtered alignments. Hope this helps.

achourasia commented 1 year ago

@rbouckaert thanks for sharing additional information, this helps us to easily identify the partitions, but we also need to find the number of patterns. I discussed this with one of my colleagues who has significant experience with phylogeny codes. He explained that counting patterns is computationally intensive and complex, and is further complicated with input files constructed in few different formats. This essentially would mean we'd need to recreate the entire parser in PHP and deal with memory and compute requirements to identify the number of patterns. So we'll need to step back and use other existing tools like IQtree to calculate and provide this information. Thanks again though.