NathanSkene / EWCE

Expression Weighted Celltype Enrichment. See the package website for up-to-date instructions on usage.
https://nathanskene.github.io/EWCE/index.html
53 stars 25 forks source link

1:1 ortholog mapping using orthogene and One2One yields different results #61

Closed KittyMurphy closed 2 years ago

KittyMurphy commented 2 years ago

1. Bug description

I have used one2one and orthogene to get mouse orthologs from a list of human genes (n=119). Using orthogene retains 47/119 genes, whereas one2one retains 67/119.

2. Reproducible example

Code


# orthogene
or_genes_mouse <- row.names(orthogene::convert_orthologs(or_genes_human,input_species = "HUMAN",output_species = "mouse"))
#> Preparing gene_df.
#> character format detected.
#> Converting to data.frame
#> Extracting genes from input_gene.
#> 119 genes extracted.
#> Converting HUMAN ==> mouse orthologs using: gprofiler
#> Retrieving all organisms available in gprofiler.
#> Using stored `gprofiler_orgs`.
#> Mapping species name: HUMAN
#> Common name mapping found for human
#> 1 organism identified from search: hsapiens
#> Retrieving all organisms available in gprofiler.
#> Using stored `gprofiler_orgs`.
#> Mapping species name: mouse
#> Common name mapping found for mouse
#> 1 organism identified from search: mmusculus
#> Checking for genes without orthologs in mouse.
#> Extracting genes from input_gene.
#> 239 genes extracted.
#> Extracting genes from ortholog_gene.
#> 239 genes extracted.
#> Dropping 24 NAs of all kinds from ortholog_gene.
#> Checking for genes without 1:1 orthologs.
#> Dropping 120 genes that have multiple input_gene per ortholog_gene.
#> Dropping 23 genes that have multiple ortholog_gene per input_gene.
#> Filtering gene_df with gene_map
#> Setting ortholog_gene to rownames.

=========== REPORT SUMMARY ===========

Total genes dropped after convert_orthologs :
   72 / 119 (61%)
Total genes remaining after convert_orthologs :
   47 / 119 (39%)

# one2one
length(or_genes_human[or_genes_human %in% One2One::ortholog_data_Mouse_Human$orthologs_one2one$human.symbol])
#> [1] 67

Data

I have attached the human gene list (n=119).

3. Session info

``` # Paste utils::sessionInfo() output utils::sessionInfo() #> R version 4.1.0 (2021-05-18) #> Platform: x86_64-apple-darwin17.0 (64-bit) #> Running under: macOS Catalina 10.15.6 #> #> Matrix products: default #> BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib #> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib #> #> locale: #> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 #> #> attached base packages: #> [1] stats graphics grDevices utils datasets methods base #> #> loaded via a namespace (and not attached): #> [1] digest_0.6.29 withr_2.5.0 magrittr_2.0.3 reprex_2.0.1 #> [5] evaluate_0.15 highr_0.9 stringi_1.7.6 rlang_1.0.2 #> [9] cli_3.2.0 rstudioapi_0.13 fs_1.5.2 rmarkdown_2.13 #> [13] tools_4.1.0 stringr_1.4.0 glue_1.6.2 xfun_0.30 #> [17] yaml_2.3.5 fastmap_1.1.0 compiler_4.1.0 htmltools_0.5.2 #> [21] knitr_1.38 ```

OR_genes_human.csv

bschilder commented 2 years ago

Hey @KittyMurphy,

These differences are due to the different databases that orthogene can pull from. The default when using convert_orthologs by itself is "gprofiler2", which pulls from the gprofiler website. The default method for all orthogene functions within EWCE, however, is "homologene", which is not only faster but has better mappings between mouse and human (though has fewer species it can map).

When you run using method="homologene", you see it returns 69 genes (2 more than one2one). This is because one2one actually uses an old static version of the NCBI HomoloGene database, whereas orthogene uses a periodically updated version of the same database.

method="homologene"

or_genes_mouse <- orthogene::convert_orthologs(or_genes_human$V1,input_species = "HUMAN",output_species = "mouse", method="homologene")
Preparing gene_df.
character format detected.
Converting to data.frame
Extracting genes from input_gene.
119 genes extracted.
Converting HUMAN ==> mouse orthologs using: homologene
Retrieving all organisms available in homologene.
Mapping species name: HUMAN
Common name mapping found for human
1 organism identified from search: 9606
Retrieving all organisms available in homologene.
Mapping species name: mouse
Common name mapping found for mouse
1 organism identified from search: 10090
Checking for genes without orthologs in mouse.
Extracting genes from input_gene.
92 genes extracted.
Extracting genes from ortholog_gene.
92 genes extracted.
Checking for genes without 1:1 orthologs.
Dropping 10 genes that have multiple input_gene per ortholog_gene (many:1).
Dropping 3 genes that have multiple ortholog_gene per input_gene (1:many).
Filtering gene_df with gene_map
Setting ortholog_gene to rownames.

=========== REPORT SUMMARY ===========

Total genes dropped after convert_orthologs :
   50 / 119 (42%)
Total genes remaining after convert_orthologs :
   69 / 119 (58%)

method="gprofiler2"

or_genes_mouse <- orthogene::convert_orthologs(or_genes_human$V1,input_species = "HUMAN",output_species = "mouse", method="gprofiler")
Preparing gene_df.
character format detected.
Converting to data.frame
Extracting genes from input_gene.
119 genes extracted.
Converting HUMAN ==> mouse orthologs using: gprofiler
Retrieving all organisms available in gprofiler.
Using stored `gprofiler_orgs`.
Mapping species name: HUMAN
Common name mapping found for human
1 organism identified from search: hsapiens
Retrieving all organisms available in gprofiler.
Using stored `gprofiler_orgs`.
Mapping species name: mouse
Common name mapping found for mouse
1 organism identified from search: mmusculus
Checking for genes without orthologs in mouse.
Extracting genes from input_gene.
239 genes extracted.
Extracting genes from ortholog_gene.
239 genes extracted.
Dropping 24 NAs of all kinds from ortholog_gene.
Checking for genes without 1:1 orthologs.
Dropping 120 genes that have multiple input_gene per ortholog_gene (many:1).
Dropping 23 genes that have multiple ortholog_gene per input_gene (1:many).
Filtering gene_df with gene_map
Setting ortholog_gene to rownames.

=========== REPORT SUMMARY ===========

Total genes dropped after convert_orthologs :
   72 / 119 (61%)
Total genes remaining after convert_orthologs :
   47 / 119 (39%)
KittyMurphy commented 2 years ago

Great, thank you, Brian!

NathanSkene commented 1 year ago

Interesting! If One2One says they are orthologs then I'd lean towards trusting that. Orthogene has nunerous methods, right? Are some more conservative?


From: Kitty Murphy @.> Sent: 04 May 2022 15:11 To: NathanSkene/EWCE @.> Cc: Subscribed @.***> Subject: [NathanSkene/EWCE] 1:1 ortholog mapping using orthogene and One2One yields different results (Issue #61)

This email from @.*** originates from outside Imperial. Do not click on links and attachments unless you recognise the sender. If you trust the sender, add them to your safe senders listhttps://spam.ic.ac.uk/SpamConsole/Senders.aspx to disable email stamping for this address.

  1. Bug description

I have used one2one and orthogene to get mouse orthologs from a list of human genes (n=119). Using orthogene retains 47/119 genes, whereas one2one retains 67/119.

  1. Reproducible example Code

orthogene

or_genes_mouse <- row.names(orthogene::convert_orthologs(or_genes_human,input_species = "HUMAN",output_species = "mouse"))

> Preparing gene_df.

> character format detected.

> Converting to data.frame

> Extracting genes from input_gene.

> 119 genes extracted.

> Converting HUMAN ==> mouse orthologs using: gprofiler

> Retrieving all organisms available in gprofiler.

> Using stored gprofiler_orgs.

> Mapping species name: HUMAN

> Common name mapping found for human

> 1 organism identified from search: hsapiens

> Retrieving all organisms available in gprofiler.

> Using stored gprofiler_orgs.

> Mapping species name: mouse

> Common name mapping found for mouse

> 1 organism identified from search: mmusculus

> Checking for genes without orthologs in mouse.

> Extracting genes from input_gene.

> 239 genes extracted.

> Extracting genes from ortholog_gene.

> 239 genes extracted.

> Dropping 24 NAs of all kinds from ortholog_gene.

> Checking for genes without 1:1 orthologs.

> Dropping 120 genes that have multiple input_gene per ortholog_gene.

> Dropping 23 genes that have multiple ortholog_gene per input_gene.

> Filtering gene_df with gene_map

> Setting ortholog_gene to rownames.

=========== REPORT SUMMARY ===========

Total genes dropped after convert_orthologs : 72 / 119 (61%) Total genes remaining after convert_orthologs : 47 / 119 (39%)

one2one

length(or_genes_human[or_genes_human %in% One2One::ortholog_data_Mouse_Human$orthologs_one2one$human.symbol])

> [1] 67

Data

I have attached the human gene list (n=119).

  1. Session info

Paste utils::sessionInfo() output

utils::sessionInfo()

> R version 4.1.0 (2021-05-18)

> Platform: x86_64-apple-darwin17.0 (64-bit)

> Running under: macOS Catalina 10.15.6

>

> Matrix products: default

> BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.dylib

> LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

>

> locale:

> [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

>

> attached base packages:

> [1] stats graphics grDevices utils datasets methods base

>

> loaded via a namespace (and not attached):

> [1] digest_0.6.29 withr_2.5.0 magrittr_2.0.3 reprex_2.0.1

> [5] evaluate_0.15 highr_0.9 stringi_1.7.6 rlang_1.0.2

> [9] cli_3.2.0 rstudioapi_0.13 fs_1.5.2 rmarkdown_2.13

> [13] tools_4.1.0 stringr_1.4.0 glue_1.6.2 xfun_0.30

> [17] yaml_2.3.5 fastmap_1.1.0 compiler_4.1.0 htmltools_0.5.2

> [21] knitr_1.38

OR_genes_human.csv

— Reply to this email directly, view it on GitHubhttps://github.com/NathanSkene/EWCE/issues/61, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AH5ZPE7SYGBHT5XSJ3YDLEDVIKAQTANCNFSM5VCDKUCA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

bschilder commented 1 year ago

@NathanSkene please see my previous explanation of the source of differences: https://github.com/NathanSkene/EWCE/issues/61#issuecomment-1137620249

When you run using method="homologene", you see it returns 69 genes (2 more than one2one). This is because one2one actually uses an old static version of the NCBI HomoloGene database, whereas orthogene uses a periodically updated version of the same database.