Closed kittyBS closed 11 months ago
Hi! The orthologies file is only needed if you are using the preparation_input_tool. As long as you end up with a input file with the following format: Ref_GeneID Ortholog1(Int_GeneID|virtual_coordinate|Chromosome|strand) Ortholog2 Ortholog3...OrthologN for coding genes Like BL24937 ENSG00000122565|141|7|+ ENSG00000108468|776|17|- ENSG00000094916|482|12|- Where BL2497 is one of the reference species genes (in this case one of the already sorted amphioxus genes) and the three ENSGXXXX are three orthologs found in human (the interrogated species) in the chromosomes 7 position 141, chromosome 17 position 776 and chromosome 12 position 482 respectively. This file can be generated in lots of different ways, being of course one of them the use of the preparation_input_tool. Only in the event of using this tool, you will need a file_orthologies file. This file can be either obtained using orthofinder, or any other orthology assessing software and the numbers of "Orthologic_family_code(number)" and "Orthologic_subfamily_code(number)” are numbers assigned (by orthofinder in this case) to each family or subfamily. So, I’ll try to explain these using the example provided in the preparation_README.md: 22 Hsa ENSG00000127951 22 22 Bla BL74412 22 29 Hsa ENSG00000116194 29 29 Hsa ENSG00000136859 29 29 Hsa ENSG00000130812 3623 29 Bla BL05848 29 29 Bla BL12211 3623 31 Hsa ENSG00000101280 31 31 Hsa ENSG00000091879 31 31 Hsa ENSG00000154188 32 31 Bla BL13639 31 31 Bla BL24935 31 31 Bla BL00184 31
In this case the family 22 (again, a number assigned by orthofinder to this orthologic cluster) is composed by ENSG00000127951 and BL74412 and they belong to subfamily 22 (meaning there are no subfamilies), pretty straightforward. But in the family 29, we can se that although ENSG00000116194, ENSG00000136859, ENSG00000130812, BL05848 and BL12211 are members of the same orthologic cluster (or family) with the assigned number 29, two of them belong to a different subfamily, ENSG00000130812 and BL12211, which belong to the subfamily cluster number 3623. What does this mean? In this context, BL12211 and BL05848 appear to be amphioxus paralogs that existed before the 2 rounds of whole genome duplication, and after different deletions, 3 genes remained in human, two of them closely related to BL05848 and one more related to BL12211. Finally this is a file created using only the human and amphioxus entries from a file with more species generated with orthofinder, that’s why in the family 31, there is a human gene with a subfamily 32 that is not present in amphioxus. I hope I have cleared your doubts! In any case, when using the preparation_input_tool, the subfamily value is not strictly required, but it is recommended as you can easily change the column used to the subfamily one and run the analysis with closer orthology groups. Also, take into account that this script should be modified to work with any species different than human as an interrogated species!
If you have any doubt, don’t hesitate to contact me again, I’ll be glad to help!
Hello, I'm new to this field, so please forgive me for bothering you again. I'm having trouble figuring out how to obtain the necessary information for creating the file_orthologies dataset, specifically regarding the Orthologic_family_code(number) and Orthologic_subfamily_code(number). Could you provide some guidance on how to address this issue? Thank you.