amkozlov / raxml-ng

RAxML Next Generation: faster, easier-to-use and more flexible
GNU Affero General Public License v3.0
374 stars 62 forks source link

Binary data input format #177

Closed maggiesudo closed 5 months ago

maggiesudo commented 5 months ago

Hi there, I would like to use raxml-ng to perform phylogenetic analysis on my binary data. I have a presence/absence data of 73,875 different genes and 151 lines. The presence-absence are encoded by '1' and '0'. I am having a hard time figuring out how to prepare the binary data so that the software can read it. This is how my data look currently after going through many examples input data for raxml8 and raxml-ng.

151 73875 LOC_Os01g01010 1111111111111111111111111111111111111111111111111111111 LOC_Os01g01019 11111111111111111111111111111111111111111111111111111111

I ran the following command: raxml-ng --all --data-type BIN --msa binary_input_v2.phy --model BIN+F+G

This is the error I get when I try to run RaxML-ng: ERROR: Error loading MSA: cannot parse any format supported by RAxML-NG!

I do not have the the msa in other formats (alignment of proteins etc). Thank you in advance. I would really appreciate any input!

amkozlov commented 5 months ago

In your example, two sequences have different lengths, which is not allowed in an alignment.

If it doesn't fix the problem, please re-run with --log debug --msa-format PHYLIP and look for the error messages.

maggiesudo commented 5 months ago

Thanks for the speedy reply! In the original file they have the same length, which is 73,875 sites for each line. I rerun with the command as suggested and got the following error.

RBA partial loading: OFF |noname| |BIN+FC+G4m| || [00:00:00] Reading alignment from file: binary_input_v2.phy

ERROR: Error loading MSA: Unable to parse PHYLIP file: binary_input_v2.phy (LIBPLL-232): Sequence 1 (LOC_Os01g01010) longer than expected

amkozlov commented 5 months ago

OK, it indicates that there must be a formatting issue with your PHYLIP file. Can you open it with another program, eg AliView?

maggiesudo commented 5 months ago

Hi you are right. The error stems from formatting issues with my PHYLIP file. It is working now with the new phylip file. I converted my binary data from .csv file to .fasta, and then used AliView to save the .fasta as .phylip. I am not sure if that's the right way to do it but my raxml-ng is running now! I did not perform msa step on my binary data though. Is that step necessary to prepare the input for RAxML?

amkozlov commented 5 months ago

Well, if every column in your file encodes presence/absence of a particular, specific gene, then your data is already "aligned".

maggiesudo commented 5 months ago

Okay thank you!