PROconsortium / PRoteinOntology

Other
12 stars 3 forks source link

S. pombe protein complex terms (many) #106

Open nataled opened 9 years ago

nataled commented 9 years ago

Hello,

For GO annotations and network representation (we're using esyN - www.esyn.org), we would find it very useful to have a set of PRO entries for S. pombe complexes. We maintain a list of GO cellular component complex terms and annotated genes that we hope is a good starting point.

May we have PRO terms/ids/etc. for the pombe versions of the complexes in this list (one per GO term)? For us, the ones with PMID references in the "source" column are higher priority than the rest.

ftp://ftp.ebi.ac.uk/pub/databases/pombase/pombe/Complexes/Complex_annotation

(with explanation ftp://ftp.ebi.ac.uk/pub/databases/pombase/pombe/Complexes/README)

If you need us to attach or email a copy of the file, or if you have any problems or questions, please let us know.

Thanks! Midori (and the rest of the PomBase curators)

Reported by: mah11

nataled commented 9 years ago

Hi Midori,

This can be done, even automated, but I see several questionable cases that lead me to think the list isn't complete in terms of complex components, and this is not even counting the lack of cardinality or modification information. For example, how likely is it that GO:0071014 "post-mRNA release spliceosomal complex" contains only a single type of protein? I see other complexes with similar issues. Technically speaking, we don't need to even indicate the subunits, and don't even need to indicate all the components, so the terms can be made, but probably you want more than just "protein complex X (S. pombe)" and probably want to avoid the misleading look of failing to indicate all.

Please let me know how you'd like to proceed.

Darren

Original comment by: nataled

nataled commented 9 years ago

Hi Darren,

Thanks for commenting so promptly on this request, and for the excellent questions. I'll have to ask Val address your concerns, because the complex inventory file is something she's been curating sort-of-manually from PomBase's GO cellular component annotation set. I therefore don't have nearly as good a sense as she would of how complete the inventory is, and in particular, the balance between incompleteness that reflects incomplete curation versus incomplete knowledge available for us to curate.

We may be able to provide a shorter list of complexes that are more nearly completely characterized, and that we most want to see represented in PRO (for example, I have a GO annotation I could hang on an S. pombe RFC complex ID from a paper I was just reading an hour ago).

Midori

Original comment by: mah11

nataled commented 9 years ago

Sorry we should have only sent you the experimental ones for starters. That is anything in the file with a "PMID". So ignore anything with PomBase GO_REF:0000002/IEA GO_REF:0000024/ISO.

All of the ISO data are manually curated, and I usually try to identify every subunit in a complex when I do ISO from SGD, however, for some things (like spliceosomal subcomplexes) I have not done this thoroughly if there is a splicing complex grouping term. I will annotate the other subunits of the 'spliceosomal disassembly complex' tomorrow.

I just noticed also that the final 2 column headers are incorrect xref_dbname source should be source xref_dbname I thought that this was corrected so we need to check that the file was correctly updated. I am pretty certain that it wasn't as the new version of the file should have "|" separated PMIDs if there are multiple papers, and I don't detect any pipes in the file on our ftp site....

Apologies....

Val

Original comment by: ValWood

nataled commented 9 years ago

The version of the file is correct, the PMID's are comma spearated, not pipe e.g. PMID:16079914,PMID:16079914 It is just the headers that are incorrect. I will get this fixed.

Val

Original comment by: ValWood

nataled commented 9 years ago

I'm pretty sure the file headers are correct; they're just a bit cryptic. "Source" means the reference.

Original comment by: mah11

nataled commented 9 years ago

You are correct. Will clarify the column headers. Sorry for the confusion. Val

Original comment by: ValWood

nataled commented 9 years ago

Val, I'm not sure I understood your message about which to ignore. On the one hand you say to ignore GO_REF:0000024/ISO, but then you say you manually verified these (by the way, I also see GO_REF:0000024 associated with ISM). If they are verified, should they not be included?

Another question: how should I handle cases where, say, one component is IEA but all others have "good" codes? For example, hcr1 in GO:0070993 is IEA while others are experimental (note: I'm ignoring the IEA part of those that have multiple codes; not sure why there are things like 'IEA,IEA').

Original comment by: nataled

nataled commented 9 years ago

Some numbers: There are 446 GO complexes listed. Of these, if we ignore those that have any IEA or ISO component, we lose about one-fourth. If we ignore those do not have PMID for all components, we lose half.

Original comment by: nataled

nataled commented 9 years ago

Hi Darren,

Sorry I wasn't clear. I hope the following clears up some confusions.

  1. I had spotted the duplicated evidence codes and have already reported this.

  2. All of the ISO/ISM annotations are manually curated, but inferred from sequence similarity. There is likely no experimental data for these as yet, but so far for complexes in S. c where the members are conserved 1:1 in pombe the complexes have been identical composition.

  3. For some EXP described complexes we also have cardinality data, but we have not exported this. We can make this available to you too.

  4. There are only 112 IEAs, we will try to resolve these over the next couple of months, by manually annotating the ones which are split between experimental and IEA codes, and supressing some which are to 'generic' grouping complex or component terms

  5. What type of modification data do you include in the complex entries?

There is no hurry for this, we just wanted to get this in motion so we could make annotations to complexes and create complex pages in PomBase.

We can clean up this file over the next couple of months and take it from there.

Best

Val

Original comment by: ValWood

nataled commented 9 years ago

Cross posted, my comments should address this one too

Original comment by: ValWood

nataled commented 9 years ago

Some comments on your list of comments: 1) The duplicate codes don't bother me; already wrote a script to clean them.

2) IMO anything manually verified is good to go. In PRO we do have complexes that are inferred by comparison with those in other organisms.

3) Cardinality would be excellent!

4) Not sure what you mean by "generic grouping complex or component terms." Of the 112 IEAs, only 33 are associated with complexes with otherwise 'better' evidences.

5) We can include all kinds of modifications. For example, you might want to specify that a particular complex contains a phosphorylated form of a protein. See PR:000037300 for an example. It would be no problem to make the complexes first, then change the components to something more specific later, if you'd like.

Consider it in motion! I won't make a further move on it until I get the word from you. My preference is to do all the eligible ones at once rather than something like "only those with PMIDs first, then ISOs later." However, if the need for a specific complex (or limited set of them) arises before the bulk are ready, we'll make them right away.

Original comment by: nataled

nataled commented 9 years ago

Re 4)

Too general /not a ‘specific complex’ /not sure that they are a complex in pombe/ or will be replaced by a more specific annotation in the term set below will filter GO:0000015 phosphopyruvate hydratase complex GO:0000148 1,3-beta-D-glucan synthase complex GO:0000159 protein phosphatase type 2A complex GO:0000786 nucleosome GO:0002178 palmitoyltransferase complex GO:0005891 voltage-gated calcium channel complex GO:0005952 cAMP-dependent protein kinase complex GO:0030118 clathrin coat GO:0030119 AP-type membrane coat adaptor complex GO:0030130 clathrin coat of trans-Golgi network vesicle GO:0030131 clathrin adaptor complex GO:0031515 tRNA (m1A) methyltransferase complex GO:0032300 mismatch repair complex GO:0032301 MutSalpha complex GO:0032302 MutSbeta complex GO:0033573 high-affinity iron permease complex GO:0034703 cation channel complex GO:0034704 calcium channel complex GO:0034707 chloride channel complex GO:0042765 GPI-anchor transamidase complex GO:0043527 tRNA methyltransferase complex GO:0071010 prespliceosome GO:1902562 H4 histone acetyltransferase complex GO:0097346 INO80-type complex GO:0043189 H4/H2A histone acetyltransferase complex GO:0031332 RNAi effector complex

Will manually annotate (more specifically in some cases), or remove GO:0000930 gamma-tubulin complex GO:0000932 cytoplasmic mRNA processing body GO:0000346 transcription export complex GO:0000347 THO complex GO:0000444 MIS12/MIND type complex GO:0000930 gamma-tubulin complex GO:0000932 cytoplasmic mRNA processing body GO:0005643 nuclear pore GO:0005663 DNA replication factor C complex GO:0005665 DNA-directed RNA polymerase II, core complex GO:0005680 anaphase-promoting complex GO:0005684 U2-type spliceosomal complex GO:0005760 gamma DNA polymerase complex GO:0005852 eukaryotic translation initiation factor 3 complex GO:0005960 glycine cleavage complex GO:0008180 COP9 signalosome GO:0008280 cohesin core heterodimer GO:0008622 epsilon DNA polymerase complex GO:0016282 eukaryotic 43S preinitiation complex GO:0016442 RISC complex GO:0016591 DNA-directed RNA polymerase II, holoenzyme GO:0016602 CCAAT-binding factor complex GO:0022627 cytosolic small ribosomal subunit GO:0030119 AP-type membrane coat adaptor complex GO:0030688 preribosome, small subunit precursor GO:0030870 Mre11 complex GO:0031011 Ino80 complex GO:0031515 tRNA (m1A) methyltransferase complex GO:0032040 small-subunit processome GO:0033290 eukaryotic 48S preinitiation complex GO:0035267 NuA4 histone acetyltransferase complex GO:0043564 Ku70:Ku80 complex GO:0043599 nuclear DNA replication factor C complex GO:0070390 transcription export complex 2 GO:0070993 translation preinitiation complex GO:0071004 U2-type prespliceosome GO:1990077 primosome complex GO:0000812 Swr1 complex GO:0042575 DNA polymerase complex

unsure but one of the above will happen with these: GO:0009316 3-isopropylmalate dehydratase complex GO:0009331 glycerol-3-phosphate dehydrogenase complex GO:0009349 riboflavin synthase complex GO:0032777 Piccolo NuA4 histone acetyltransferase complex GO:0032797 SMN complex GO:0035339 SPOTS complex (I don’t know what this is) GO:0097361 CIA complex

Original comment by: ValWood

nataled commented 9 years ago

Re 1) The duplicate codes don't bother me; already wrote a script to clean them.

This should now be fixed in our next release

Re 3) Cardinality

What we have is along these lines heteromeric(2) Thakurta AG et al. (2004) So we do not say explicity which units these apply to, but if the info is there, and at least we know which publication it is in. We will extend this so we also know which complex it applied to if there are multiple complexes. A small oversight!

Re 5) We will arrange for the modification data to be exported, this has been on the to do list for a while. It will be in this format: http://www.pombase.org/submit-data/modification-bulk-upload-file-format

Give us a while to tidy the IEAs and we will send you a new version of the file.

Thanks for you speed and attention, as always!

Val

Original comment by: ValWood

nataled commented 6 years ago

Original comment by: nataled

nataled commented 6 years ago

Going through old requests and closing those that were finished long ago or marking as "pending" those that await input from the requester. If your request is marked Pending, please advise as to whether the request has been satisfactorily addressed or is no longer needed.

Original comment by: nataled

nataled commented 6 years ago

I think we would still like to have the requested terms eventually, but it isn't urgent for us. "Pending" status fits our situation just fine.

Original comment by: mah11