davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
699 stars 188 forks source link

Using OrthoFinder for pan-genomics? #327

Closed soungalo closed 4 years ago

soungalo commented 4 years ago

Hello, I am developing a pipeline for construction of pan genomes in Eukaryotes and was wondering if and how OF can be used. In general, the pipeline follows these steps:

  1. Assemble genomes of multiple varieties/strains/individuals of the same species
  2. Annotate each genome to produce predicted protein sequences
  3. Compare proteins sets in order to create a matrix of gene presence/absence per genome

As you can understand from step 3, some clustering procedure is needed in order to match proteins that are essentially "the same gene". This is a bit different from the common practice of detecting orthologs between species, although I guess the basic concept is the same. What's important is that I don't want to cluster together genes of the same family (paralogs) - just actual "orthologs". This is why I think OF can work well for this case, but not sure how to configure/tweak it, and looking for some advice. Would you suggest that I try and break orthogroups on the basis of duplications inference (e.g. in the Duplications.csv file)? Another option - should I try modifying some cutoffs that will make clustering more stringent? Can I (and should I) set some cutoff on the E-value of blast/diamond matches? Or maybe the inflation parameter can help me? Any other advice would be welcome. Thank you!

davidemms commented 4 years ago

Hi Lior

Thanks for getting in touch. The default approach with more than two species/varieties/strains is orthogroups. With species this works well. Hypothetically, if the species are really closely related and no gene duplication events have taken place then each orthogroup would only have one gene per species and all these genes would be orthologs of one another. Sticking with species for just a bit longer, it's important to understand what is and isn't mathematically possible. Namely, if gene duplication events have occurred, it is mathematically impossible to cluster genes into disjoint sets of orthologs such that all orthologs are in the same set and the sets contain no paralogs. Is your concern with not wanting paralogs related to this issue of gene duplication events creating within-orthogroup paralogs (i.e. https://github.com/davidemms/OrthoFinder#orthogroups-orthologs--paralogs)?

I haven't tried OrthoFinder myself on many strains from the same species. I assume you are trying to deal with cases with multiple genes per strain in the orthogroups? Are your problems with the very largest orthogroups (first 10 to 50 orthogroups)? Or do many of the more 'normal' orthogroups have multi-copy genes?

For the more normal orthogroups, are the gene duplication events at the root of the tree, with the two copies seen for many of the strains? Or are the near the tips of the tree and affecting a single strain? If you could send me an example tree, that might be best for discussing what the best options would be.

All the best David

soungalo commented 4 years ago

Hi David, and thanks for your reply. I'll give some more details and also some examples. My data (currently) consists of four Arabidopsis thaliana ecotypes ("strains"), which I assembled and automatically annotated. I then just fed all predicted proteins to OrthoFinder. The histogram of orthogroups size looks like this: hist so the vast majority of OGs seem totally valid and don't contain any duplications. My problem is with all OGs having 5,6 or more genes (~14% of all OGs). Let's take OGs with six genes for example. There are 269 such OGs, of which 86% have one gene in two ecotypes and two genes in the other two ecotypes. I took five such OGs as test cases. As you can see in the attached Duplications.csv excerpt, duplications occur in various places on the trees (also attached). So here is what I was thinking: maybe I could use the pairwise orthology inferences in the Orthologues directory (see attached example) to refine the OGs by breaking them in a way that will ensure no more than one gene per ecotype exists in the same OG. I am not exactly sure how yet, but it sounds like some variant of the algorithms described in this study. This assumes that orthology inferences made according to the gene trees are reliable. What do you think? Can this work? Do you have any experience with such analysis? Or perhaps you have other ideas on how I can achieve this result?

Thanks a lot! Lior

Orthogroups.txt Duplications.txt SpeciesTree_rooted_node_labels.txt gene_trees.txt An-1.maker.proteinsvCan-0.maker.proteins.txt

davidemms commented 4 years ago

Hi Lior

Thanks for the extra info, that makes things a lot clearer.

I would think the simplest, but also the most correct thing to do would be to do the presence and absence at the orthogroup level. Each orthogroup is a 'gene' and each ecotype either has or doesn't have that gene. I believe most of these orthogroups with more than 4 genes are valid. It doesn't matter if one or more of your ecotypes happens to have an extra copy, that happens all the time, a gene getting duplicated is just like any other type of mutation in a genome, and this will have occurred for some of your genes in some of your ecotypes. The branch lengths show there has been virtually no divergence so they are essentailly the same gene. To see how frequently genes duplicate look at the example results directory you can download here (or run an analysis yourself with a few speicies): https://davidemms.github.io/orthofinder_tutorials/diving-into-the-results.html, there are many duplications on the terminal branches of the species tree and many of these duplicates are frequently lost again equally quickly. For your Step 3 above, going to the trouble of removing duplicate genes will leave your presence/absence matrix the same. Remember also that it is possible for a gene to have more than one ortholog in another species, orthology relationships aren't always one-to-one.

There could be a few cases where splitting the orthogroups could be useful though. This is where OrthoFinder has incorrectly fused two orthogroups together and said it's just one orthogroup. In these cases you would split the two sets of genes since the correct action is to record the presence/absence for the two orthogroups separately that have been incorrectly fused together. These cases would be most easily identified by parsing the Duplications.csv file for duplications that occurred at the root of the species tree. You would then treat the two descendant sets as separate orthogroups. This is the one step I would recommend.

For the gene trees you provide, it does often look like one pair of genes may be more divergent and can be split off from the rest. If you have a reason for doing this (given the points above) and you want to do this as accurately as possible then you might need to use the gene trees rather than inferred orthologs. This is because within a population, it is ambiguous how a tree should be rooted since ILS makes it impossible to say that a particular species is the outgroup for a given gene tree. This makes orthology hard to gauge using standard approaches with a rooted gene tree. You might instead look for a long branch separating two clades of genes. If you're going to do this though, I would make sure that you have something to gain from removing these genes since they don't seem to affect the presence/absence information.

All the best David

soungalo commented 4 years ago

Hi David, and sorry for the late reply. Thanks for your interesting reply. I just want to update that I ended up implementing the "Minimum weight orthogonal partition" (MWOP) algorithm (explained in the study I mentioned above) based on OrthoFinder output. After adding a few variants to the algorithm, I am now able to break orthogroups into smaller clusters in which all genes are orthologous to all genes. I believe this is a necessary step when using OrthoFinder for pan-genome construction and hope to publish it (as part of a broader research) in the future. Just for reference, here is a comparison of the distributions of orthogroup sizes before and after applying MWOP on proteins of four Arabidopsis accessions:

hist I am closing this issue for now, but would be happy to discuss it further if you are interested.