drostlab / myTAI

Evolutionary Transcriptomics with R
https://drostlab.github.io/myTAI/
GNU General Public License v2.0
37 stars 16 forks source link

Motivation - method bias #22

Closed kullrich closed 1 year ago

kullrich commented 1 year ago

Hi,

I can understand that due to co-author binding you change the motivation to state only GenERA as gene-age classifier in first place.

However, in my opinion this is not valid scientific practise. In earlier versions the whole bunch of different software tools were mentioned which all produce a gene-age map to be used with myTAI. Now these can be found still here: https://drostlab.github.io/myTAI/articles/Introduction.html#retrieval-of-phylogenetic-or-taxonomic-information

Since myTAI is not restricted to GenERA gene age maps, please change accordingly.

E.g. recently I extracted gene age maps for the whole eggnogg6 and plaza database, should we now mention all 5000x species as pre-calculated phylomaps?

Best regards

Kristian

HajkD commented 1 year ago

Hi @kullrich

Thank you very much for reaching out and for notifying me about the unfortunate phrasing in the README.

As you correctly stated, the Introduction Vignette of myTAI clearly lists all existing software (at least the ones I am aware of and I am always grateful to anyone who is adding new software and sending me a pull request). Just to clarify, this listing is not only available in earlier versions of myTAI, but also in the current version (https://drostlab.github.io/myTAI/articles/Introduction.html#retrieval-of-phylogenetic-or-taxonomic-information) and this won't change in future versions.

What we meant to state in the Motivation section of the README is that from the list of existing gene age inference software we recommend using GenEra, because it addresses the homology detection issue and other shortcomings that were previously published as critiques of gene age inference (which was not addressed in most of the listed software). In fact, the entire debate on the shortcomings of gene age inference can be followed here (https://drostlab.github.io/myTAI/articles/Phylostratigraphy.html) and we prominently advertise this discussion as part of the myTAI documentation.

Obviously, our recommendation is just our personal view based on our personal expert opinion and for that reason, we started to repopulate and resurrect our documentation page for all previously published phylostratigraphic maps, which contains detailed descriptions on which of the listed tools was used to generate the respective phylostratigraphic maps (see all details here: https://github.com/HajkD/published_phylomaps). Furthermore, @LotharukpongJS started to develop the data package phylomapr to make access to the various previously published phylostratigraphic maps as easy as possible for use with myTAI.

Hence, your generous offer to either add the 5000x phylostratigraphic maps you generated to the existing phylomaps collection (https://github.com/HajkD/published_phylomaps) or contribute to phylomapr or develop an analogous data package where these 5000x phylomaps can easily be accessed by users based on the scientific names of the respective organism would be greatly appreciated. Would it be possible to let me know with which software (was it orthomap?) and based on which approach these maps were computed and where they can be found currently?

As for the text in the Motivation section, I now rewrote it from:

To overcome this limitation, the myTAI package introduces procedures summarized under the term evolutionary transcriptomics to integrate gene age information inferred with GenEra into classical gene expression analysis. Previously inferred gene age information can be found here, of which recent precomputed gene age information can be retrieved via phylomapr.

To:

To overcome this limitation, the myTAI package introduces procedures summarized under the term evolutionary transcriptomics to integrate gene age information into classical gene expression analysis. Gene age inference can be performed with [various existing software](), but we recommend using GenEra or orthomap, since they address published shortcomings of gene age inference (see detailed discussion here). In addition, users can easily retrieve previously precomputed gene age information via our data package phylomapr.

I hope this clarifies our scientific motivation and personal views.

Best, Hajk

kullrich commented 1 year ago

Hi @HajkD, thank you for the in-depth response.

And yes, the pre-calculated phylomaps were created with orthomap, but of course the main work was done from Ana-Hernandez-Plaza and colleagues (see here for thier latetst update of the EggNOG database resource https://academic.oup.com/nar/article/51/D1/D389/6833261) and from Michiel Van Bel and colleagues (see here for their latetst update of the PLAZA database resource https://academic.oup.com/nar/article/50/D1/D1468/6423187).

The 1322 eukaryotic species from EggNOG as well as all the PLAZA dicots and monocot species can be found here (https://zenodo.org/record/7803262), the prokaryotic species from EggNOG are another story since (as you know) the classical gene age classification with small number of core-genes and large number of accessory genes in prokaryotic species, subspecies and strains makes it more complicated.

Since orthomap only consider orthologous groups, all species specific genes are by definition not classified directly (unless duplicated within the species), but the experimentator should decide on its own if all unclassified genes should either fall into the "oldest" or "youngest" phylostratum or just be not considered at all.

An indvidual species and its map can be e.g. extracted like this:

# download the Map for eggnogg6 database
download.file( url      = "https://zenodo.org/record/7803262/files/eggnog6_eukaryota_orthomaps.tsv.zip",
               destfile = "eggnog6_eukaryota_orthomaps.tsv.zip")
utils::unzip( zipfile = "eggnog6_eukaryota_orthomaps.tsv.zip",
              files = "eggnog6_eukaryota_orthomaps.tsv")

# install the readr package
install.packages("readr")

# install the readr package
install.packages("dplyr")

# load package readr
library(readr)

# load package readr
library(dplyr)

# read
eggnogg6 <- readr::read_tsv("eggnog6_eukaryota_orthomaps.tsv")

# show all available species by name
unique(eggnogg6$name)

# show all available species by taxID
unique(eggnogg6$taxID)

# restrict e.g. to Arabidopsis thaliana (taxID:3702) by name
eggnogg6 %>% filter(name=="Arabidopsis thaliana")

# restrict e.g. to Arabidopsis thaliana (taxID:3702) by taxID
eggnogg6 %>% filter(taxID==3702)

# restrict e.g. to Arabidopsis thaliana (taxID:3702) by name to be directly used with myTAI
eggnogg6 %>% filter(name=="Arabidopsis thaliana") %>% select(GeneID=seqID, Phylostratum=PSnum)

# restrict e.g. to Arabidopsis thaliana (taxID:3702) by taxID to be directly used with myTAI
eggnogg6 %>% filter(taxID==3702) %>% select(GeneID=seqID, Phylostratum=PSnum)

Best regards

Kristian

kullrich commented 1 year ago

And of course not to forget the PLAZA data using again Arabidopsis thaliana:

download.file( url      = "https://zenodo.org/record/7803262/files/plaza_v5_dicots_ORTHOFAM_orthomaps.tsv.zip",
               destfile = "plaza_v5_dicots_ORTHOFAM_orthomaps.tsv.zip")

utils::unzip( zipfile = "plaza_v5_dicots_ORTHOFAM_orthomaps.tsv.zip",
              files = "plaza_v5_dicots_ORTHOFAM_orthomaps.tsv")

# read
plaza_v5_dicots <- readr::read_tsv("plaza_v5_dicots_ORTHOFAM_orthomaps.tsv")

# restrict e.g. to Arabidopsis thaliana (taxID:3702) by name
plaza_v5_dicots %>% filter(common_name=="Arabidopsis_thaliana")

# restrict e.g. to Arabidopsis thaliana (taxID:3702) by taxID
plaza_v5_dicots %>% filter(taxID==3702)

# restrict e.g. to Arabidopsis thaliana (taxID:3702) by name to be directly used with myTAI
plaza_v5_dicots %>% filter(common_name=="Arabidopsis_thaliana") %>% select(GeneID=seqID, Phylostratum=PSnum)

# restrict e.g. to Arabidopsis thaliana (taxID:3702) by taxID to be directly used with myTAI
plaza_v5_dicots %>% filter(taxID==3702) %>% select(GeneID=seqID, Phylostratum=PSnum)

or Physcomitrium patens (not Physcomitrella patens anymore)

plaza_v5_dicots %>% filter(common_name=="Physcomitrium_patens")

see list of included species:

unique(plaza_v5_dicots$common_name)
 [1] "Anthoceros_agrestis"        "Aethionema_arabicum"       
 [3] "Acer_truncatum"             "Actinidia_chinensis"       
 [5] "Arabidopsis_lyrata"         "Avicennia_marina"          
 [7] "Amaranthus_hybridus"        "Aquilegia_oxysepala"       
 [9] "Arachis_hypogaea"           "Arabidopsis_thaliana"      
[11] "Amborella_trichopoda"       "Brassica_carinata"         
[13] "Brassica_napus"             "Brassica_oleracea"         
[15] "Brassica_rapa"              "Beta_vulgaris"             
[17] "Camellia_sinensis_var"      "Capsicum_annuum"           
[19] "Cannabis_sativa"            "Cicer_arietinum_L"         
[21] "Corylus_avellana"           "Chara_braunii"             
[23] "Coffea_canephora"           "Citrus_clementina"         
[25] "Ceratophyllum_demersum"     "Carpinus_fangiana"         
[27] "Cardamine_hirsuta"          "Carya_illinoinensis"       
[29] "Citrullus_lanatus"          "Cucumis_melo"              
[31] "Corchorus_olitorius"        "Carica_papaya"             
[33] "Chenopodium_quinoa"         "Chlamydomonas_reinhardtii" 
[35] "Capsella_rubella"           "Cucumis_sativus_L"         
[37] "Daucus_carota"              "Davidia_involucrata"       
[39] "Durio_zibethinus"           "Erigeron_canadensis"       
[41] "Eucalyptus_grandis"         "Erythranthe_guttata"       
[43] "Eutrema_salsugineum"        "Fragaria_x_ananassa"       
[45] "Fragaria_vesca"             "Gossypium_hirsutum"        
[47] "Glycine_max"                "Gossypium_raimondii"       
[49] "Helianthus_annuus"          "Hydrangea_macrophylla"     
[51] "Lupinus_albus"              "Lotus_japonicus"           
[53] "Lonicera_japonica"          "Lactuca_sativa"            
[55] "Magnolia_biondii"           "Micromonas_commoda"        
[57] "Malus_domestica"            "Manihot_esculenta"         
[59] "Marchantia_polymorpha"      "Medicago_truncatula"       
[61] "Nelumbo_nucifera"           "Nicotiana_tabacum"         
[63] "Olea_europaea"              "Oryza_sativa_ssp"          
[65] "Petunia_axillaris"          "Prasinoderma_coloniale"    
[67] "Punica_granatum"            "Physcomitrium_patens"      
[69] "Prunus_persica"             "Pisum_sativum"             
[71] "Papaver_somniferum"         "Populus_trichocarpa"       
[73] "Quercus_lobata"             "Rosa_chinensis"            
[75] "Rhododendron_simsii"        "Striga_asiatica"           
[77] "Salvia_bowleyana"           "Salix_brachista"           
[79] "Simmondsia_chinensis"       "Sechium_edule"             
[81] "Sequoiadendron_giganteum"   "Sapria_himalayana"         
[83] "Solanum_lycopersicum"       "Selaginella_moellendorffii"
[85] "Schrenkiella_parvula"       "Solanum_pennellii"         
[87] "Solanum_tuberosum"          "Selenicereus_undatus"      
[89] "Trochodendron_aralioides"   "Tarenaya_hassleriana"      
[91] "Trifolium_pratense"         "Tripterygium_wilfordii"    
[93] "Utricularia_gibba"          "Vaccinium_macrocarpon"     
[95] "Vigna_mungo"                "Vanilla_planifolia"        
[97] "Vitis_vinifera"             "Zea_mays
HajkD commented 1 year ago

Dear @kullrich

Wow, this is absolutely wonderful!! Thank you so so much for this!

Would you like to send this as PR to the phylomaps repo or should @LotharukpongJS or I add it there (or into phylomapr)?

I am so happy to see that the Gene Age Inference community and datasets are gaining traction again and I hope together with myTAI a lot of wonderful new studies can be enabled.

With many thanksa nd very best wishes, Hajk