Creating input for analysis

amit4mchiba commented 1 year ago

Hi Agora-team,

I have been trying to learn how to construct ancestral genome karyotype for ever, and was never been able to understand it properly and then saw your paper and I was like wow. It is explained so nicely and I feel that I can do that. While I do understand based on the manual the process to run the program, I wonder what is the recommended process to generate the input files. For example, I have 20 plant genomes, and the genome assemblies are chromosome-scale with nice contiguity. I have used the annotationed genomes, used protein files, and run Orthofinder. Orthofinder analysis constructed gene files for all the families, and also resulted in the species file. Is it ok to use output from this program directly to Agora? What preprocessing would be needed? I am not sure as how to prepare the input file, and therefore, your suggestion and guidance will be highly appreciated.

thanks and regards Amit

alouis72 commented 1 year ago

Hi Amit, I wrote a script to reformat orthoGroups from Orthofinder: on the github agora dev branch:

https://github.com/DyogenIBENS/Agora/tree/dev/src/import/orthofinder_hogs you should be able to run it with the data produced by OrthoFinder2 located in :

Results_XXXX/Phylogenetic_Hierarchical_Orthogroups/

You will have to use the species tree produced by orthofinder for AGORA that should be : Results_XXXX/Species_Tree/SpeciesTree_rooted_node_labels.txt

To run the script (here is an example):

create the directory where you want to write orthoGroups for AGORA mkdir -p Agora_data/orthoGroups
run the script with Orthofinder Hogs (for me located in "$HOME/src/OrthoFinder/tmp/Results_Apr07/Phylogenetic_Hierarchical_Orthogroups ») python ./convert_hogs_sp.py -of_hogs $HOME/src/OrthoFinder/tmp/Results_Apr07/Phylogenetic_Hierarchical_Orthogroups -outdir Agora_data/orthoGroups

Please tell me if you manage to use it, and succeed to run AGORA with that.

regards, Alexandra

amit4mchiba commented 1 year ago

Dear Alexandra,

I am now trying this. Please give me a day, and I will be back with good or sad news...:)

Thank you so much for such a prompt reply and your help.

with best regards Amit

amit4mchiba commented 1 year ago

Hi Alexandra,

It worked, and I could see that several files has been created and a folder "ancGenomes" has ancestral genome. I wanted to request if you could advice me on the manual that describes the results. In my case, I could see ancestor genomes at different nodes. Result for species nodes has "CAR", which stands for contiguous ancestral regions. These represents chromosomes, right? In my case, thousands of "CARs". So, this must be fragmented genome of the ancestor right? Do you have any advice to optimize this? Also, how to know the quality of the results.

I am so sorry to ask many questions. I am now going through your paper, and trying to see if I can get answers myself. Also, I wanted to request if you could direct me to some document where there is explanation on how to set the constraint. I am working on plant genomes, and these are expected to have WGDs for many times. Therefore, I am assuming that these constraints are based on that. Similar explanation has been given by IAGS (https://github.com/xjtu-omics/IAGS). If I am not wrong, the species tree and the orthogroups that one gets from Phylogenetic_Hogs data should take care of duplications and such events, right?

Again, appolgoes for asking too many questions and any guidance would be highly appreciated.

thanks and regards Amit

amit4mchiba commented 1 year ago

Dear Alexandra,

I am so sorry for asking many questions, but could you please advice as how to acertain the number of chromosomes for anceint genome using extant genomes. I do understand that using extant genomes, we can run agora, and that results in ancestral genome construction with assigned CARs and one get detailed information on the orthologs and the conserved synteny. Is there any way the results from Agora could be used to predict number of chromosomes that the ancestral genome had at a certain node? I have searched a lot of articles, and I am not able to really fine the method. I read work from Jerome Salse's nature genetics paper in 2017, and he does compare and get the synteny and CARs, but then how he arrives to the conclusion that the ancestor should had X number of chromosomes is not clear to me.

Is there anyway one could even predict or guess or suggest using Agora's output? I see that In genomicus database, you do have karyotypes for plants. How to reach to that point?

Lastly, is there any way one could compare ancestral genome derived from Agora to the genomicus database?

I am so sorry for asking many questions, and will be very grateful for your advice and guidance.

thank you so much,

with best regards Amit

DyogenIBENS / Agora

Creating input for analysis #26