WormBase / genedesc_generator

Automated gene descriptions generator for model organism databases
Other
1 stars 0 forks source link

Create orthology module #16

Closed valearna closed 5 years ago

valearna commented 6 years ago
  1. put orthology sentence at the beginning of the description if an ortholog is present in orthology file
  2. Species list and orthology rule:
    • Caenorhabditis elegans --use orthology to human
    • Caenorhabditis briggsae --use orthology to elegans
    • Caenorhabditis japonica --use ortholgy to elegans
    • Caenorhabditis remanei --use orthology to elegans
    • Caenorhabditis brenneri --use orthology to elegans
    • Brugia malayi --use orthology to elegans, if not present use orthology to Onchocerca
    • Onchocerca volvulus --use orthology to elegans, if not present, then use orthology to Brugia
    • Pristionchus pacificus --use orthology to elegans
    • Strongyloides ratti --use orthology to elegans, if not present, use orthology to Brugia and Onchocerca
    • Trichuris Muris -- use orthology to elegans, if not present, use orthology to Brugia 3. For orthology to human genes, use only those human genes that have been predicted by more than one orthology prediction method
  3. Template of an Orthology sentence
    • c elegans: (i) is an ortholog of human \<human gene symbol> (\<human gene name>) (ii) is an ortholog of human \<human gene1 symbol> (\<human gene1 name>), \<human gene2 symbol> (\<human gene2 name>), and \<human gene3 symbol> (\<human gene3 name>) (iii) is an ortholog of members of the human \\ gene family including <human gene symbol1>)

Non-elegans species: (i) is an ortholog of \<worm gene symbol> (ii) is an ortholog of \<worm gene1 symbol>, \<worm gene2 symbol>, and \<worm gene3 symbol> (iii) is an ortholog of members of the C. elegans \<gene class name> gene class including \<worm gene symbol1> - up to three genes sorted by decreasing popularity (using Textpresso paper score) (iV) is an ortholog of members of the C. elegans \<gene class1 name>, , and including \<worm genesymbol1> - up to three genes sorted by decreasing popularity; also limit number of gene classes to 3, based on member Textpresso paper popularity score.

  1. Template rules
    • take ensembl_gene_id from col 3 in orthology file and query https://rest.genenames.org/fetch/ensembl_gene_id/\<ID> to get name and symbol.
    • take the gene(s) with the highest number of methods (up to 3 genes) and apply template (i) or (ii)
    • if more than 3 genes with the same number of methods identify human gene families (only for C elegans) through https://rest.genenames.org/fetch/ensembl_gene_id/\<ID> and show one gene for family through template (iii) - for all other species use gene class in the same way as gene family.

How to pick orthologs for non-elegans species when tied for orthology methods (tie-breaker rules): We have too many C.elegans genes listed as orthologs for a non-elegans gene; these will be pruned using popularity (via number of publications) and (gene class name):

  1. If more then 3 orthologs, in the form of gene names (eg, abu-6, abu-7, abu-8) use the popularity and gene class to prune
  2. Genes that don't have any other members of their class get picked and mentioned first, by popularity, if not alphabetical; if tied by popularity, order by alphabetically
  3. If there is only one gene class, use popularity to pick the top 3 genes, order by popularity, if tied, use numerical
  4. If there is more than one gene class, use popularity to pick the top gene in each class, meaning you would name the leading (in popularity) gene class first; if tied, order alphabetically; in both cases
  5. If cosmid names (eg.C54D10.9), list upto 3, as gene class cannot be used
  6. If both genes and cosmids are present, use the gene class and popularity rules for the genes and leave the cosmids as is (total upto 3)
  7. If genes without gene classes and with gene classes are present, use popularity to pick either/both genes and gene classes (total upto 3). Order the genes first and then the gene classes, mention a single gene for each gene class, picked by popularity.
  8. If both popularity and gene class cannot be used, leave as such, upto 3.

How to pick human orthologs for C. elegans when tied for orthology methods (tie-breaker rules):

  1. Group human genes by their family names, when 3 or more human orthologs are present, mention the first member
  2. If more than one family is present, include the first member from each family
  3. If a gene does not fall into any human gene family, leave as is, and mention this gene first, with the word 'human' before it.
  4. If a gene is the only member of a human gene family, mention the gene with the word 'human' before it, and not the family.
valearna commented 6 years ago

Instead of creating a request to Human gene names API for each gene, we can download a file containing the data we need for all genes at the same time. This will save A LOT of time.

https://www.genenames.org/cgi-bin/download?col=gd_hgnc_id&col=gd_app_sym&col=gd_app_name&col=gd_pub_ensembl_id&col=family.id&col=family.name&status=Approved&status=Entry+Withdrawn&status_opt=2&where=&order_by=gd_app_sym_sort&format=text&limit=&hgnc_dbtag=on&submit=submit

valearna commented 5 years ago

We decided to remove rule 3. and we now include all human orthologs without excluding those with only 1 method.