FlyBase / GO-curation

For projects related to GO curation in FlyBase
MIT License
0 stars 0 forks source link

Checklist for new DRSC "Gene Set Enrichment Tool" #37

Closed gantonazzo closed 1 year ago

gantonazzo commented 1 year ago

The tool can be found here: https://www.flyrnai.org/tools/enrichment_tool/web/home/7227 This ticket is to keep track of various tasks needed before signing-off the tool:

hattrill commented 1 year ago

Subsets from GO release 2022-07-01 Contains Fly and GOC subets: 2022_SUBSET.xlsx

I think that for the fly enrichment, we could have both the generic and fly subsets option - might as well have the choice to tailor it for flies.

Also, we should think about excluding any direct annotations to GO:0070062 extracellular exosome as this is causing massive problems with CC enrichment in the human set when enriching using the relationships in the ontology to "map up":

Exosome_issue
hattrill commented 1 year ago

D.mel Complexes in Complex Portal https://docs.google.com/spreadsheets/d/1TbRJ9Xv7fhP_6cR0OAxzWnwuQvBSCgNyoCsW-YvYDFk/edit#gid=0 https://docs.google.com/spreadsheets/d/1YkDmmQLezHmq7_yEns3ei4EP4qdJZM5nALgBsauK1yo/edit?pli=1#gid=0 https://www.ebi.ac.uk/complexportal/complex/search?query=*&species=Drosophila%20melanogaster&page=1

So, looks like it is worth it to use as a gene set fo D.mel

hattrill commented 1 year ago

Phenotypes: Giulia emailed Arzu. Central message: harmonisation efforts for phenotypic data are currently on hold at the Alliance.

hattrill commented 1 year ago

EMAIL CORRESPNDANCE WITH CH: from HA: While I am thinking about it, I compared the generic GOC subset (formally known as slims) and the fly subset and I think that there is some value in having the generic one for all species but for fly having both the generic one and the fly tailored subsets as a choice.

I also think that it might be worthwhile excluding annotations directly GO:0070062 extracellular exosome - this is causing massive problems with CC enrichment in the human as UniProt did a huge project annotating exosome datasets - it was never appropriate to use the GO for this! - have had discussions with the GOC on how to solve this - it always comes out as the top enriched term and because of the structure of the ontology, it also makes it look like all these proteins are extracellular! I've had a couple of cases where I've had to help human researchers interpret their enrichment results because of this. This is not a problem for most other species - but causes real confusion with human CC data.

We also discussed the ComplexPortal as a geneset and we think that this would be worthwhile. I am having a look over the fly data in CP at the moment to cross-check it with our complexes.

For phenotypes: Giulia emailed Arzu who is on the Alliance Working group for this. The central message is that harmonisation efforts for phenotypic data are currently on hold at the Alliance. So perhaps this is a "for the future" thing and "not a now" thing.

Reply: CH 1.) SLIM: we will keep 2 SLIM sets for fly 2.) I just looked at our database and confirmed that "GO:0070062" is indeed very different for human than other species. There are about 2k human genes associated with this GO while only a few hundred for mouse and 2 digits for fly and zebrafish. We will remove the human gene sets associated with this term. 3.) We will add the complex annotation from ComplexPortal. 4.) Phenotype annotation: adding phenotype gene sets for other species will happen in the future when AGR provides such annotation. Shall we consider adding the fly phenotype sets? If so, we will need some help getting the info from FlyBase.

Reply: HA

Thank you so much for the message! 1.) SLIM: we will keep 2 SLIM sets for fly

Great. Attaching xls sheet with current slims for the generic GOC subset and our fly one.

The .obo files can be found on the GO download page and are kept up-to-date with changes in the GO:http://geneontology.org/docs/download-ontology/

2.) I just looked at our database and confirmed that "GO:0070062" is indeed very different for human than other species. There are about 2k human genes associated with this GO while only a few hundred for mouse and 2 digits for fly and zebrafish. We will remove the human gene sets associated with this term.

I will try to make a case for the GOC to do something about this, but might take some time, so sounds like the best solution.

3.) We will add the complex annotation from ComplexPortal.

Great. This set has 109 complexes for flies at the moment, and this growing as Sandra Orchard adds more of the FlyBase complexes. Other species are better well populated, but we've only just got round to working with them:https://www.ebi.ac.uk/complexportal/complex/organisms

4.) Phenotype annotation: adding phenotype gene sets for other species will happen in the future when AGR provides such annotation. Shall we consider adding the fly phenotype sets? If so, we will need some help getting the info from FlyBase.

I think that we should park the phenotypes work for the moment - think that there will be a solution produced by the various ontology harmonization projects and we'd be better off waiting for them.

One thing that did occur to me, is a Disease Ontology enrichment. Not sure if it would be worth it, but we could look into this.

All the MODs are using the DO: https://www.alliancegenome.org/disease/DOID:4 ; 'All disease associations' download can be found here: https://www.alliancegenome.org/downloads

And there is a common subset used for ribbon displays:

e.g. http://flybase.org/reports/FBgn0000490#hdm

gantonazzo commented 1 year ago

Jira ticket in place for Jim/IU for the pre-population of the enrichment tool input gene list from FB: https://flybase.atlassian.net/browse/WEB-2028

hattrill commented 1 year ago

Linking to first round of improvements sent to Claire https://docs.google.com/document/d/1U6l7OWG-5t3st-i2HZNDG0ObH5MBhh6LDS67MS4tC98/edit

hattrill commented 1 year ago

Linking to Cam group meeting discusssing barriers to using pheno file for us and pheno file reforms in general

hattrill commented 1 year ago

Gil has made new pheno file as spec's in https://docs.google.com/presentation/d/1Ad8eX_KMf7U5wqzzOaYZX5-DvTVUcJ-bmS5aCVK4EUs/edit#slide=id.g16ce49be32b_0_2 and on Jira DB-813 genotype_phenotype_report_fb_2022_06_reporting_2.tsv.gz

hattrill commented 1 year ago

What FBcv terms should be used for enrichment tool? "single_nobal_noFBti_v2_NoTn_NoANAT_Dmel" File of 'single allele Dmel FBcv' only single_nobal_noFBti_v2_NoTn_NoANAT_dmel.txt

Only terms under phenotype FBcv:0001347

hattrill commented 1 year ago

Flattened list,but only to 95 terms. phenotypic_class_nameid_subset.txt Give this set to DRSC. They can try using only terms under phenotype FBcv:0001347 and this subset