Review HTP annotations - Githubissues

pgaudet commented 7 years ago

Hello,

As an action of the Oct 2017 GOC meeting, ECO has create evidence codes for HTP experiments.

We ( @hatrill and the HTP working group) will provide annotation guidelines; once these are available, we will ask each group to review papers that we have flagged as potentially HTP:

https://docs.google.com/spreadsheets/d/11xExGJfj_39xPQUGkam3Xvtd6dtZ5DfANXhM2ZtDYB0/edit#gid=0

More details on this task to come later.

Thanks, Pascale

hattrill commented 7 years ago

I have made a new sheet on the HTP paper google spreadsheet for potential HTP papers annotated using the GO. https://docs.google.com/spreadsheets/d/11xExGJfj_39xPQUGkam3Xvtd6dtZ5DfANXhM2ZtDYB0/edit?ts=58d39700#gid=2144791301 This was generated by splitting out the experimental evidence codes and collating the papers with >40 lines of annotation per paper/per evidence code. For most, I have been able to do this directly from the GAF they submit to the GOC so that any DB-specific refs are included. For a few others, indicated on the sheet, I had to download them from QuickGO - I think most of these use Protein2GO, so won't be amplified by the many:1 gene mapping.

The contributers looked at are: AgBase ARUK-UCL BHF-UCL dictyBase EcoliWiki FlyBase MGI MTBBASE ParkinsonsUK-UCL PomBase RGD SGD TAIR UniProt WB ZFIN (these were selected from what had already been seem when looking at HTP data in other ways). I've included columns for adding whether these papers are HTP and if they have been reviewed.

If these are ok, we can contact the groups and ask them to review the annotations.

Added a condensed version of guidelines cutting out some of the background, etc. https://docs.google.com/document/d/1_T5FarM7eddFqO7DWooP5UOx1vw4H6lBMktGc62uFIY/edit#

pgaudet commented 7 years ago

Thanks @hattrill This is great ! I made a few edits (in suggestion mode) and made a few comments.

What's the next step? Should we review this with the HTP group ?

ValWood commented 7 years ago

PomBase Checking these I think for most we are happy that the experiments are hypothesis driven and the controls are adequate (these papers just have lots of data). For most of these its a large number of annotations toa single paper, but most are to different terms, or variable extensions. We will use the new HTP codes for the HTP localization study. A couple of more recnet studies we need to double check the methods, these will probably migrate to HTP codes too.

PomBase PMID:27984725 191 EXP https://www.pombase.org/reference/PMID:27984725 PomBase PMID:16823372 6892 IDA will migrate to http PomBase PMID:19040720 89 IDA https://www.pombase.org/reference/PMID:19040720 PomBase PMID:20970342 88 IDA https://www.pombase.org/reference/PMID:20970342 PomBase PMID:20838651 77 IDA https://www.pombase.org/reference/PMID:20838651 PomBase PMID:21386897 72 IDA https://www.pombase.org/reference/PMID:21386897 PomBase PMID:10759889 61 IDA need to check methodology PomBase PMID:22146723 61 IDA need to check methodology PomBase NA NA IEP PomBase NA NA IGI

sabrinatoro commented 7 years ago

@hattrill I have reviewed the annotations from ZFIN. None of these are from HTP, therefore these annotations should remain as such. [note: 3 of these pubs are from data loads] Please let me know if you have any questions.

hattrill commented 7 years ago

Thanks for taking the time to look! Sorry about the IEAs creeping through, I somehow missed a step in the processing of your GAF that was supposed to have stopped that!

Helen

On 16 Nov 2017, at 19:39, sabrinatoro notifications@github.com wrote:

@hattrill https://github.com/hattrill I have reviewed the annotations from ZFIN. None of these are from HTP, therefore these annotations should remain as such. [note: 3 of these pubs are from data loads] Please let me know if you have any questions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geneontology/go-annotation/issues/1655#issuecomment-345039068, or mute the thread https://github.com/notifications/unsubscribe-auth/ALUOkkUNz-LBre8HpQLLfGg1typEdRVvks5s3I9ugaJpZM4P8CXn.

ValWood commented 7 years ago

We have changed our local set up to accept the new evidence codes. Before we do the migration, these are not yet documented on the GO website. Will they be accepted if we use them?

pgaudet commented 7 years ago

Good question! There is 1 annotation in QuickGO (!) and I cannot find the evidence codes in AmiGO. @tonysawfordebi @kltm is everything in place ?

Thanks, Pascale

tonysawfordebi commented 7 years ago

There are lots more than that; it looks like the most recent set of annotations hasn't been indexed yet in QuickGO. They should be available either later today or tomorrow.

cmungall commented 7 years ago

"I cannot find the evidence codes in AmiGO"

It doesn't display in the evidence facet because there are no annotations that use it yet. Once there are you'll see it there. No actual changes need to be made in amigo, it is entirely driven by ECO

pgaudet commented 7 years ago

OK great - so we can go ahead and change all annotations.

Thanks, Pascale

ValWood commented 7 years ago

evidence codes need adding to current docs too

pgaudet commented 7 years ago

This is in progress: http://wiki.geneontology.org/index.php/Guide_to_GO_Evidence_Codes

hattrill commented 7 years ago

@cmungall @kltm I think that the filter-gene-association.pl script will need to be updated as well.

kltm commented 7 years ago

Re: https://github.com/geneontology/go-annotation/issues/1655#issuecomment-346587315 @cmungall Or, I assume, that we will just worry about any needs in the new pipeline?

pfey03 commented 6 years ago

All confirmed Dicty HTP annotations have been updated with new ECO in P2GO thanks to Tony's help. Thanks Helen for the list!

srengel commented 6 years ago

Mike was working on updating the filter-gene-associations.pl script on November 13.

hattrill commented 6 years ago

Brilliant. Thanks, @srengel! Checked svn and in script.

mah11 commented 6 years ago

PomBase annotations reviewed, and updated where appropriate in GAF committed today.

rachhuntley commented 6 years ago

Hi, Can I get opinions on a HTP paper I'm checking, please?

The paper (PMID:23979707) identifies proteins in the extracellular matrix by SILAC-based proteomics. FDR was set to 1% (acceptable in the guidelines). 128 proteins were identified, 9 of these can be discounted because only have 0 or 1 unique peptide (need 2 or more from the guidelines). However, 25 proteins are ribosomal (only 1 can be discounted by <1 unique peptide), 2 are elongation factors, one histone and one mitochondrial import receptor subunit. In our correspondence with the author, she said “Regarding the endothelial cell matrixome, there are 128 proteins which constitute 90% of the matrix proteome (See column H in Table S6). I think it is fair to annotate those to the GO:0031012 extracellular matrix. Although I would like to highlight that some if the identified proteins, e.g. ribosomal proteins, are not strictly extracellular matrix components, but are proteins that can be deposited in the extracellular matrix by the cells.” (In the paper the author refers to known secreted proteins, including HIST1H4A).

The experiment seems robust, but we're not very happy about cherry-picking the data and only curating those that we expect to be in the ECM, i.e. not annotating the ribosomal proteins etc. to this term, but if we include them they are likely to be flagged as incorrect somewhere down the line.

What would you suggest?

Thanks, Rachael.

hattrill commented 6 years ago

Looking at the set with >2 unique peptide that are marked as a "matrixome" component, there are a lot of intracellular proteins.

From the experimental protocol, I would describe this as an enrichment rather than a purification. Although, I am not quite sure how they managed to get quite so many intracellular proteins......perhaps the cell removal or matrix washing steps?

I would not annotate this set and cherry picking would just be circular reasoning and they don't provide any experiments to do a validation.

ValWood commented 6 years ago

I agree, if a discovery driven experiment clearly has a high number of false positives, and no way can be identified to perform a clean-up they will probably cause more problems than they add value.

But maybe the problem in this particular case is that common contaminants (highly expressed genes) have not been excluded? For instance, I am reliably informed by proteomics community that there are some highly expressed proteins which are expressed at such a high level, that with modern sensitive techniques, they appear in every experiment. These are routinely screened (point 6 above). Maybe a similar situation exists here, and if so, applying a filter for these high expressed proteins in some situations might not really be cherry picking, but just good practice?

A couple of our community have mentioned to me shared lists for such common contaminants would be valuable, and I know most proteomics groups have their own lists (although these may need to be developed in an species-by species basis). This might be something the QC group could develop in the longer term.

I wanted to mention this because in some situations there will be a fine line between cherry-picking and judicious application of filters.....but it all comes down to curator judgement....

hattrill commented 6 years ago

Hi Val, the "CRAPome" https://reprint-apms.org/?q=chooseworkflow is, I think, probably close to what you are think of.

ValWood commented 6 years ago

yes! ...there is an 'ome for everything!

hdrabkin commented 6 years ago

Just an update: I have looked at the papers, and outside of questions for a couple, will be converting. SInce these involve many annotations at once, I will need our SEs to run scripts to change in mass. This will take a while depending on ticket triage. FYI, there are at least two annotations from GOA that have these codes , but are not in Amigo yet as they are recent. They should be in after my friday commit.

rachhuntley commented 6 years ago

Regarding the paper for the ECM proteome (PMID:23979707) discussed above, I contacted the author of this paper again to see if she had any additional evidence that would either confirm or discount the ribosomal proteins in the ECM.

She replied saying that they have seen ribosomal proteins in the ECM and secreted medium a few times, so it seems she doesn’t consider this artefactual.

Quote "We have analysed by MS also ECM produced by human normal and cancer-associated fibroblasts in culture and found ribosomal proteins in there too. You can find the data in Supplementary Data 1 in Hernandez-Fernaud et al. Nat Comms2017 (https://www.nature.com/articles/ncomms14206 ). Similarly we found ribosomal proteins in the ECM produced by fibroblast lines in collaborative unpublished works. “

I also looked in PubMed for more evidence of extracellular ribosomal proteins and found this paper (http://onlinelibrary.wiley.com/doi/10.1002/jcp.25898/full) that says "Interestingly, many ribosomal proteins were detected in the nonmin-ECM, meaning that a high translational activity was ongoing in the MSCs that were producing the ECM and these proteins stick to the ECM despite the extensive washings.”

And also this one (https://www.ncbi.nlm.nih.gov/pubmed/28196878) "Several ribosomal proteins were highly abundant in the Extracellular Vesicle fraction upon infection, and our data strongly suggest that secretion of translational machinery and concomitant inhibition of translation are important parts of host response against Gram-negative bacteria sensing caspase-4/5 inflammasome."

So there is evidence from other proteomic studies that see secreted ribosomal proteins.

Additionally, exosomes have been shown to contain several of the ribosomal proteins found in the original paper and https://www.nature.com/articles/ncomms8164 describes how secretion of exosomes is required for cell movement "at any given time there may be a pool of matrix-carrying exosomes in multivesicular late endosomes that could be rapidly secreted and used for migration.” So it is possible that ribosomal proteins get outside of the cell in this manner.

Having discussed with @RLovering, we are loathe to take these annotations out. I will obviously convert them to HDA code and I can attach a comment in Protein2GO detailing all that I’ve found out. We can review this periodically and revise the annotations if newer, better evidence is published.

ValWood commented 6 years ago

I still seems odd to me to annotate to "extracellular matrix" (even if they are secreted that doesn't make them a matrix component does it ?)

There is plenty of opportunity for ribosomal proteins to end up outside the cell (with 10,000,000 ribosomes per cell and 200 proteins , that 2000 million opportunities per cell !)

Has anybody ever not seen ribosomal proteins in a proteomics experiment?

rachhuntley commented 6 years ago

The experiment in the paper used a ECM preparation, so they aren't just looking at secreted proteins, these were just other examples I found. and yes, I've seen proteomics experiments without ribosomal proteins.

RLovering commented 6 years ago

Philosophically speaking I guess we should be careful about having too many preconceived ideas about how cells work, scientists are not supposed to ignore data just because it doesn't fit with their expectations - haha - have a good weekend

ValWood commented 6 years ago

I've seen proteomics experiments without ribosomal proteins.

pre-processed or post processed? all of the "unfiltered" results I have seem for pombe have included ribosomal proteins and other highly expressed proteins (admittedly this is not ECM, but... a small number of lysed cells would be enough...)

Philosophically speaking I guess we should be careful about having too many preconceived ideas about how cells work, scientists are not supposed to ignore data just because it doesn't fit with their expectations - haha - have a good weekend

Yes, but GO is about "knowledge" not "data".....biologist may provide data, and even detailed models, that does not mean that every piece of data is "GO worthy".

To me the role of a curator is (partly) to validate, synthesize and integrate (otherwise I wouldn't be doing it!). We are not peer reviewers, but we really need to consider whether a HTP dataset is useful to GO in its entirety, or if filters need to be applied. This is something we need to do more for GO as time progresses and is probably worthy of discussion at a future meeting.

Maybe ribosomal proteins do have role outside of the cell, but considering that they are nearly always present in a proteomics experiment, and there are no papers suggesting a functional role (is there any evidence beyond speculation? if so we can revise for sure).

Blimey now I'm really confused ... I'll try to have a good weekend though ;)

ValWood commented 6 years ago

I just asked Kathy Gould my go-to proteomics person https://medschool.vanderbilt.edu/visp/person/kathy-gould-phd

==

Happy New Year to you as well! It is a quick question. We always have some level of ribosomal protein contaminants. Cheers, Kathy

== Re: Quick question (really!)

On 1/12/18, 1:50 PM, "Valerie Wood" vw253@cam.ac.uk wrote:

Happy New Year!
Do you ever do proteomics experiments and not have ribosomal protein contaminants? Best, Val

ValWood commented 6 years ago

Kathy has done masses of small scale and HTP proteomics experiments on pombe...

hattrill commented 6 years ago

Looking at the protocol, I still think that this was an enrichment, rather than a purification. If there was some further validation or cross-correlation with another approach, then I would say fine.

Ribosomal proteins are frequently seen in mass spec experiments, so I don't think that their presence should be a criteria for exclusion but, if they are excessively represented in the sample, that should ring alarm bells. Looking at table 6 http://www.mcponline.org/content/12/12/3599/suppl/DC1 there are also many other cytosolic proteins in the mix. Do these represent exosome contents or contaminants?

Bear in mind that the definition of GO:0044420 "extracellular matrix component" specifes that it is a constituent of the ECM.

Speaking for what we havve done at FlyBase, I have dropped many enrichment datasets that were labelled with GO terms, including all the plasma membrane preps.

ValWood commented 6 years ago

I agree...this dataset has 1300 protein and it includes all of the usual contaminants (most universally highly expressed proteins including ribosome, proteasome, translation elongation factors and Glyceraldehyde-3-phosphate dehydrogenase to name a few....).

Kathryn Lilley , director, Cambridge Centre for Proteomics also says mass spec studies rarely have no ribosomal contaminants.

rachhuntley commented 6 years ago

So, firstly we have only annotated 118 proteins with ECM, I don’t know what table you are looking at Val, with 1300 proteins.

To answer Helen’s question. There are 25 proteins out of the 118 that have been annotated to ECM from another paper and 99 proteins that have been annotated to exosome/extracellular space or region.

The fact that only ~20% of the proteins overlap with other proteins annotated to the ECM is not surprising, since the ECM is dynamic and can vary depending on cell/tissue type. Quote " On the basis of the relative amounts and organization of the different ECM components, this molecular scaffold is peculiar for each tissue and reflects the specific functions required for the cells present in that tissue.” from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4081568/

It is because of this fact that we are so eager to capture these different studies. Not only are we annotating to ECM, we are also including the cell or tissue type that the ECM is part of. It is these details that cardiovascular scientists have requested us to capture from the literature to help them in their analyses. For example, not all blood vessels develop atherosclerosis, and it is thought that the ECM may influence which vessels do. Our annotations will capture the type of blood vessels that these proteins are associated with. Does it help if we add the colocalizes_with qualifier (although I guess this is a whole other ongoing discussion!).

If ribosomal proteins are a common contaminant, we are happy to remove these from this set, but there should be a statement in the guidelines that ribosomal proteins should not be annotated from any proteomics experiment unless there is good justification for it.

As for enrichment vs. purification, we are unsure of the definition for each of these – when does an enrichment become a purification? We would argue that if the authors' preparation is not pure enough, we would expect to see a lot more than 118 proteins in this extract.

Finally, I am certainly no expert in proteomic techniques, whereas the lab this paper has come from are (M. Mann), so I am not willing to go back to them to say that we don’t trust their data (or the reviewers of their papers).

ValWood commented 6 years ago

So, firstly we have only annotated 118 proteins with ECM, I don’t know what table you are looking at Val, with 1300 proteins.

I was looking at the table table 6 http://www.mcponline.org/content/12/12/3599/suppl/DC1

I now see they select the top 128 based on

LFQ Median/Total Sum (%) ? but it is not clear whether this takes account of likely contamination by highly abundant proteinsl?

I'm not saying that there is a problem with the experiment, or the review process but that this type of dataset might not be "GO worthy". Especially if it represents pathological processes? If it represents a normal process (it would be nice to see some ribosomal proteins localized to ECM by an alternative method). To me it this does seem like a purely "discovery" type experiment.

RLovering commented 6 years ago

Hi Val

sorry I have 3 deadlines that involve papers and grants this week so I don't have time to look at this myself in detail. However, I would have thought that all purification experiments that provide information about cellular components are "discovery" type experiments. I am assuming that purification of a CC followed by western with specific antibodies to what is known to be in the CC is not discovery, but if this is followed by other westerns with antibodies to proteins not previously known to be in the CC then would this be discovery or not? If someone cuts multiple protein bands out of a gel, following a purification step, and sequences each one it is this discovery or not? And is this HTP or small scale?

Of the papers that have been agreed to be annotated using the HTP code are any of these capturing CC data? And if so are any based on purification followed by sequencing or western?

Thanks

Ruth

ValWood commented 6 years ago

It just seems odd to me as a curator to use a dataset where the threshold is positioned at a point which seems quite arbitrary and most likely includes many false positives. Take a look at the scores around above and below the cut off. Using a different metric could dramatically change what was included.

For small scale, there is more confidence as there is almost always want some triangulation of other data types used to make a GO annotation from a Mass Spec (reciprocal experiments, homology and known complex membership etc.). These are usually hypothesis driven, and the presented results are usually conservative. We would only annotate the very high confidence.

We would (and often do) ask the authors for a threshold that excluded likely false positives, or not include for GO.

I'm not sure what the solution is.... We have a similar problem with our HTP localization dataset. However the error rate is pretty low (less than 5%), and we then filter known false positives once we are aware of them. https://curation.pombase.org/dumps/latest_build/pombe-embl/external_data/external-go-data/GO_ORFeome_localizations2_deleted.txt

Longer term for GO I believe we need a mechanism to deal with things which are known or unlikely to be correct- either not annotating for GO, improved thresholds, or known FP filtering.

ValWood commented 6 years ago

So we had a PhD student in the Oliver lab trying to improve compartmental proteomics as part of a collaboration the core proteomics facility. These experiments have a very large FP rate even for organelles that can be separated well. I wouldn't include them for GO. I know SGD have included a mitochondrial HTP one, but if I remember correctly this was high confidence because it included 3 orthogonal methods. It seems that on its own these type of experiment on their own might not be great for GO (even taking into account the HTP evidence code). To me these belong only in a proteomics database.

I also worries me when we are referring to the secretome of cancer cells. Aren't we annotating normal processes any more? Apart from the fact that a few lysed cells would provide lots of ribosomal contamination..... If the authors genuinely believe that the extracellular matrix has ribosomal proteins as normal components wouldn't they try to demonstrate this by another method?

hattrill commented 6 years ago

Trying tie-up some loose ends. Had a look over the HTP spreadsheet https://docs.google.com/spreadsheets/d/11xExGJfj_39xPQUGkam3Xvtd6dtZ5DfANXhM2ZtDYB0/edit#gid=2144791301 (important page is the first, “Over_40_per_evidencecode”). Looks like these groups are done:

ARUK-UCL BHF-UCL dictyBase FlyBase ParkinsonsUK-UCL PomBase UniProt WB ZFIN

These groups still have papers that are not marked as reviewed:

MGI SGD TAIR AgBase EcoliWiki MTBBASE RGD

AgBase, EcoliWiki, MTBBASE and RGD might not have been aware of the HTP review, so I am tagging @jinhuiz, @jimhu-tamu, @ggeorghiou and @slaulederkind, who are down as contact for these groups. If you would like to take part in the review, the relevant documentation can be found here http://wiki.geneontology.org/index.php/Guide_to_GO_Evidence_Codes. Please ask if you have any questions.

BTW: I am going to try to write a short paper for the ISB DATABASE Journal Biocuration issue about GO curation of HTP data (to be submitted end of Oct). I intend that anyone who has done any of the retrofits, been on the HTP calls and ontologists will be listed as a author. I will get a version up for your comments in Oct.

srengel commented 6 years ago

i have just now completed marking the remaining SGD papers as reviewed.

pgarmiri commented 6 years ago

Hi, the MTBBASE annotations have been reviewed and the spreadsheet has been updated with the changes. Penelope

hattrill commented 6 years ago

Hello all!

Thought here would be the easiest way to catch as many as possible-

First draft of ISB DATABASE Journal Biocuration HTP paper can be found here https://docs.google.com/document/d/1UGXWy38134Lvq22qb2tVizIgNUhZKQC1x5-docMb-ro/edit in the GO google drive. Bit later than I anticipated (insert excuse). If you feel like beating it with a stick, that would be appreciated. Deadline for submission is 31st, so I would appreciate it if you could look at it as soon as possible.

Doc adding your name as author is here - I will go through the various lists, etc and add folk, but I don't want to miss anyone. If you had anything to do with the working group, review, ECO codes, devs put your name down - https://docs.google.com/document/d/16XW-iKZauIN9wisQDX77_fRt5adSyuGRoIF19eSjmWM/edit

If you need access, just shout.

hattrill commented 6 years ago

Hi all,

Just been and had a tidy up of the HTP review sheet and updated it - I've sorted them into those that have been done/finished (reviewed and, if needed removed/code updated). Here's the tally of those left:

https://docs.google.com/spreadsheets/d/11xExGJfj_39xPQUGkam3Xvtd6dtZ5DfANXhM2ZtDYB0/edit?ts=58d39700#gid=2144791301

geneontology / go-annotation

Review HTP annotations #1655