PMID:12485443 and others used for lipid metabolism (annotating readouts)

Reviewing the annotation intersection violations alerted us to:

https://amigo.geneontology.org/amigo/reference/PMID:12485443 This paper describes plasma membrane lipid profiling, and as such isn't suitable to make an annotation to "lipid metabolism". There is no evidence for a role in lipid metabolism, the paper is really only looking at plasma membrane lipid composition.

The assayed gene block1s6 (HSP6) has a role in lysosomal transport, and in the formation of specialized organelles of the endosomal-lysosomal system, such as melanosomes and platelet dense granules. Bloc1s6 appears to be involved in protein sorting from the endosome and the annotations below are indirect (readouts):

Bloc1s6 protein transmembrane transport PMID:28576874 IMP 20231205 Bloc1s6 glutamate metabolic process PMID:28701731 IMP 20231212 Bloc1s6 phospholipid metabolic process PMID:28701731 IMP 20231212 Bloc1s6 glutamine metabolic process PMID:28701731 IMP 20231212 Bloc1s6 gene expression PMID:28701731 IMP 20231212 Bloc1s6 adenosine metabolic process PMID:28701731 IMP 20231212 Bloc1s6 amino acid metabolic process PMID:28701731 IMP 20231212 Bloc1s6 gene expression PMID:30498428 IMP PMID:30498428 IMP 20231212 Bloc1s6 lipid metabolic process PMID:30710063 IMP 20231213 Bloc1s6 multicellular organism growth PMID:19381131 IMP 20231204 Bloc1s6 response to activity PMID:19381131 IMP 20231204 Bloc1s6 gene expression PMID:16439805 IMP 20231129 Bloc1s6 protein targeting PMID:16760431 IMP 20231129 Bloc1s6 bradykinin biosynthetic process PMID:12485443 IMP 20231129 Bloc1s6 intracellular nitric oxide homeostasis PMID:12485443 IMP 20231129 Bloc1s6 ATP metabolic process PMID:12485443 IMP 20231129

These would be OK to make phenotype annotations in your local system, but are not suitable for GO annotation where we are trying to capture the direct roles of gene products.

@LiNiMGI could you assign the ticket to the originating curator, and refer them to the guidelines for making GO annotations using the IMP evidence code and annotating phenotypes: https://wiki.geneontology.org/Annotating_from_phenotypes and the book chapter written by Sylvainand Pascale in the GO handbook I also find very useful https://link.springer.com/protocol/10.1007/978-1-4939-3743-1_4 particularly section 2.4.3 Phenotypes

Please let us know if you have any concerns

CC @pgaudet

Doesn't the use of the qualifier 'act upstream of or within' mean that the causal influences of the function of these genes can be direct or indirect to the process annotated?

OK I did not see that qualifier in AmiGO. But I was under the impression that we did not make indirect annotation if we already know the role of the protein. I was under the impression that the "causally upstream" qualifier was more of a placeholder (at least when I have asked that has been the information I have been told). So maybe here the problem is imprecise guidelines for annotation best practice?

The problem, as I see it is that any mutant in the trafficking system is going to perturb lipid and amino acid levels (the trafficking system is responsible for trafficking amino acid transporters to the surface. So theoretically almost any part of the trafficking system is going to affect lipid and amino acid metabolism indirectly. If we make all of these annotations we lose any annotation specificity, and we aren't really using GO to describe the process that a gene product is part_of. Of course, we could filter all of the causally upstream annotation, but who would know to do that? I don't see these relations displayed in AmiGO or in UniProt.

It seems that, for a comprehensive view of biology, we would be better to annotate processes as causally upstream or downstream of each other rather than random individual genes (which could have a negative effect on analysis). It isn't only the indirect nature, there is also good evidence that the "over annotation" problem obscures enrichments (@pgaudet is going to talk about this at the GO meeting).

I won't be at the GOC meeting, but food for thought. Several years ago I did an 'experiment' with expression sets and enrichment. I don't have those data any more, but I used sets reported in manuscripts for things like genes up-regulated during limb development and muscle differentiation (C2C12 cells, I think it was). I did the enrichment analysis (VLAD) using IMP annotations and then dropping them out. When I included the IMPs, I got significant enrichment in processes that I know are associated with the developmental biology of the tissues used for the over-expression study. When I excluded the IMPs, some of those processes dropped out and some others that I knew should be associated were of lower significance. My conclusion was that the IMP annotations for mouse genes were quite valuable in the enrichment process, specifically identifying processes from real-world expression data, when trying to discover processes associated with over-expressed gene sets.

I'm definitely not making a case for dropping phenotype annotations, detailed phenotypes are very useful to infer normal processes. We just need to be careful about when we apply them (and that we are actually annotating normal processes).

It is also a good point that it's a good point that phenotype annotations can improve enrichments. If a user had a list of genes spread across 2 tightly coupled processes, they are more likely to get an enrichment if the 2 processes have a common label (although one would hope that if the annotation was comprehensive, and the gene set was significant they would also enrich the 2 processes independently).

You will notice that I did not question any of the developmental terms (like endothelical development or lung alveolal development), or terms which were related to the vesicle transport pathway in a multicellular context (melanocyte formation, cillium formation, neurogenesis etc). Developmental terms are a bit different because clearly many cellular processes are involved in multicellular development. Most cellular pathways probably contribute to these and I am not sure what the boundaries of these processes would be.

All of the terms I queried are cellular processes and the annotation seem to be quite indirect (truly phenotypic readouts). For example,the the final sentance of the paper used to annotate “lipid metabolism” the paper seems to be largly about “lipid homeostasis” in different compartments (this annotation is also made, and seems more appropriate). I could not see any experiement related to metabolic processes, but maybe I missed it.

PMID:16439805 is used to make and annotation to “gene expression” which doesn’t seem justified (this is from a single western blot and difficult to figure out how they established a 3.5% increase in expression, or why 3.5% is considered significant). There is no evidence that in a "normal" cell this gene would contribute to gene expression, but it is highly likely that in a pathogeneic situation the level could be pertur bed by 3.5%

I agree that upstream processes annotated could improve enrichment in some cases, but it depends what the researcher is trying to establish. To illustrate, fission yeast has ~100 proteins involved in cytokinesis. However, endocytosis affects the late stages of cytokinesis becasue it is required to deliver the components for septum assembly. ~10 gene products are multifunctional (regulators of both processes and some actin cytoskeleton proteins which are components of the cytokinetic ring and the endocytic machinery) are annotated to both processes. However, I woud not annotate the remaining endocytosis components to cytokinesis even thought they display a cytokinesis phenotype. If I included all of the genes which affect cytokinesis in the 'cytokinesis bin' it would increase the cytokinesis bin size to over 500. In this case if you had a set of genes that were only part of the true cytokinesis set (say 10 out of the true 100), you would likely lose any true enrichment to cytokinesis. This situation would only become worse over time as screens became more saturating.

A question one could ask, is that if the authors of the Bloc1s6 papers, instead of analysing the affect of 4 genes in the lysosome-ensosome transport pathway, they instead knocked out every gene in the pathway between the lysosome and the other compartments (probably a 3-500 proteins), and most affected plasma lipid composition in some way (which is highly likely because that is what membrane trafficking does). Would all of these genes be annotated to lipid metabolism? If the answer is yes in this larger scenario, then I guess these annotations are OK but if the answer is no then how does the gene number smaller gene number in this particular experiment change the decision?

A second issue is that because these few genes would only be a small fraction of the number that could be annotated, the "causally upstream or within" annotation is not likely to ever be comprehensive (which is an additional problem for analysis). Whether an annotation is annotated as "causally upstream" seems somewhat arbitrary.

Thirdly, a major use case for GO is retreiving process based gene lists and querying. If a researcher wanted to retrieve the list of genes involved in cytokinesis would they want the 100 genes involved in this process, or the 500 genes affecting this process in some way?

The reason I am interested in this problem is that these annotation get picked up by the function prediction community and used to make new predictions that are not part of a process and it's difficult to explain why this is a problem when the annotation exist in the GO database.

Sorry for the long rambling comment. I think it will be very interesting to look into in more depth at the upcoming meeting.

geneontology / go-annotation

PMID:12485443 and others used for lipid metabolism (annotating readouts) #5160