WormBase GPI is splitting some lines causing neo pipeline to break

cmungall commented 6 years ago

File: ftp://ftp.wormbase.org/pub/wormbase/species/c_elegans/PRJNA13758/annotation/gene_product_info/c_elegans.PRJNA13758.current.gene_product_info.gpi.gz

Some of the description lines include newlines, causing a single line to be split over two lines, breaking parsing. For example:

$ gzip -dc mirror/c_elegans.PRJNA13758.current.gene_product_info.gpi.gz | grep -n -B1 CELE_C33A11 
24676-WB        WBGene00007877  nfki-1  Nuclear Factor of Kappa light polypeptide gene enhancer in b(B)-cells 
24677:Inhibitor, delta and zeta related nbid-1|CELE_C33A11.1    protein_coding_gene     taxon:6239              UniProtKB:G5EDE9
--
24678-WB        C33A11.1        nfki-1  Nuclear Factor of Kappa light polypeptide gene enhancer in b(B)-cells 
24679:Inhibitor, delta and zeta related nbid-1|CELE_C33A11.1    transcript      taxon:6239      WB:WBGene00007877       
--
24680-WB        WP:CE24824      NFKI-1  Nuclear Factor of Kappa light polypeptide gene enhancer in b(B)-cells 
24681:Inhibitor, delta and zeta related nbid-1|CELE_C33A11.1    protein taxon:6239      WB:C33A11.1     UniProtKB_GCRP:G5EDE9|UniProtKB:G5EDE9

cc @rankishore

vanaukenk commented 6 years ago

Okay, thanks @cmungall I'll pass this along to the Hinxton group who generate the file with each WB release.

vanaukenk commented 6 years ago

This issue has been fixed for the next WB release, which should be available on our ftp site later next week. @cmungall - shall I close this ticket?

cmungall commented 6 years ago

Let's close it when it percolates through - looks like it's still there

cmungall commented 6 years ago

Another issue is the variable number of columns. There should always be 10, even if the last one or two are null.

e.g this one has 8:

WB       ZC247.1 ZC247.1         CELE_ZC247.1    transcript      taxon:6239      WB:WBGene00013859

9:

WB       WP:CE43614      ZC247.1         CELE_ZC247.1    protein taxon:6239      WB:ZC247.1      UniProtKB_GCRP:G5EBP5|UniProtKB:G5EBP5

7:

WBGene00271791  W03D2.15                CELE_W03D2.15   ncRNA_gene      taxon:6239

cmungall commented 6 years ago

Another issue

UniProtKB_GCRP is not a prefix we have registered in db-xrefs.yaml:

WB      WP:CE10938      F53F1.4         CELE_F53F1.4    protein taxon:6239      WB:F53F1.4      UniProtKB_GCRP:Q9XVM6|UniProtKB:Q9XVM6

The xrefs should just be UniProtKB:Q9XVM6

vanaukenk commented 6 years ago

@cmungall We'll fix the columns issue. Wrt the GCRP, we had thought it might be useful to indicate in the file which UniProtKB accessions corresponded to the GCRP for a given WB gene. The different prefix might not have been the best approach, but perhaps we could indicate this information some way in the properties field, i.e. column 10. I'm not sure what the best property name and value would be, maybe something like: UniProtKB_accession_type:GCRP I'm open to suggestions on that part.

cmungall commented 6 years ago

@tonysawfordebi - any suggestions on indicating GCRP membership?

cmungall commented 6 years ago

Another issue I'm afraid:

each entity ID should only be present once. The following has a dupe with a different symbol in each:

WB      WP:CE52235      C08E8.6         CELE_C08E8.6    protein taxon:6239      WB:C08E8.6      UniProtKB_GCRP:Q7YX97|UniProtKB:Q7YX97
WB      WP:CE52235      C08E8.9         CELE_C08E8.9    protein taxon:6239      WB:C08E8.9      UniProtKB_GCRP:Q7YX97|UniProtKB:Q7YX97

tonysawfordebi commented 6 years ago

@cmungall - in the gpi file that we generate for indexing protein metadata in QuickGO, we have a property - reference_proteome - in the properties column to indicate whether the protein is part of the reference proteome or not (actually, the value of the property is the internal identifier of the proteome, rather than a simple boolean flag). If the protein is part of the reference proteome for the species, we also have another property - is_isoform - that indicates whether the protein is an isoform or the canonical form. Another property that we set is db_subset, which indicates whether the entry is Swiss-Prot or TrEMBL.

vanaukenk commented 6 years ago

@cmungall - the duplicate WP:CE52235 protein ids are actually cases where two genes encode the same protein sequence. We have other cases like this, e.g. histones. For annotation, depending on the data being annotated, the curator could either select the unique gene ID for annotation or might need to select the protein if the specific genetic locus was not known.

@tonysawfordebi - yes, I had remembered these existing properties over the weekend :-). Here are the gpi properties (and values that I'm aware of) relating to sequences, including the ones you mention above:

 db_subset=TrEMBL or Swiss-Prot
 uniprot_proteome=UP000001940 (C. elegans, for example)
 is_isoform=?
 reference_proteome=?

Wrt the gpi files submitted by MODs, we (WB) thought it might be useful to indicate which of the UniProtKB accessions we reference were part of the GCRP. Looking at these property tags I'm not sure which one we should use and what makes most sense for the value. Would something like this work for the MOD files:

uniprot_gcrp=YES

Also, for properties like db_subset and reference_proteome, is it implicit that these properties refer to the source in column 1 or should we make the property names more explicit wrt the database?

Will need to update: https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-1_2.md

tonysawfordebi commented 6 years ago

@vanaukenk - I'd forgotten about the gpi file that we generate for WB (and FB and dicty and SGD!)

Unlike the one that we generate for QuickGO indexing purposes, they don't include the reference_proteome and is_isoform properties, but they could if you feel that information would be useful. And yes, if we do include such properties in these files then we should probably do it by having a property called uniprot_gcrp that takes the values 'canonical' or 'isoform', and is omitted if the protein is not in the GCRP. Or something like that,

cmungall commented 6 years ago

the duplicate WP:CE52235 protein ids are actually cases where two genes encode the same protein sequence. We have other cases like this, e.g. histones. For annotation, depending on the data being annotated, the curator could either select the unique gene ID for annotation or might need to select the protein if the specific genetic locus was not known.

You could specific multiple gene parents, as the parent field has cardinality>1.

Aside from this ticket, I'm wondering what our annotation policy is for histone genes and other analogous cases. I assume we just duplicate annotations to the identical genes?

btw, the db-xrefs yaml file seems to not have a way of resolving protein entries, and I can't find this in wormbase https://wormbase.org/search/protein/CE52235

khowe commented 6 years ago

I will make sure all these issues get fixed.

For the resolution of protein entries, the correct local id is actually WP:CE52235 (https://wormbase.org/search/protein/WP:CE52235). These additional prefixes are an anachronism and confuse things when forming CURIEs (i.e. should the CURIE be WB:WP:CE52235? Or is "WP" a resource?).

In the next release of WB (which we will start preparing in a few weeks), we will drop these prefixes. The local id for the above will then be simply CE52235.

As for the global_id / CURIE, that depends on how we choose to solve the bigger picture of making sure that all front-line ids in WormBase resolve. One way of doing this (proposed by @cmungall ) is for us to write our own resolver that will recognise all of our local ids and resolve them to the correct page. This would allow us to make all of our CURIEs have the form "WB:XXXXX". For a small number of specific data types though, this will not be possible (e.g. "JC8.10a" is an identifier for distinct CDS and Transcript objects in WB).

khowe commented 6 years ago

@cmungall @tonysawfordebi Regarding the duplication issue, the GPI format spec states that the Parent field is cardinality "0 or 1".

Also, we are trying to represent the central dogma using the Parent column, GFF3-style. That is, for a protein line, we are populating the Parent column with id of the transcript from which it is translated. Reading the spec though, it seems that this is not really what the Parent field was intended to represent. It seems very UniProt-centric (perhaps unsurprisingly).

vanaukenk commented 6 years ago

Thanks @khowe

@cmungall Wrt annotation, in WB at least, we typically associate GO annotations with WBGenes and many of our experiments are genetically based, so that works fairly well. However, there may certainly be cases where an experiment demonstrates something about a protein sequence that is shared amongst different genes. In that case, we probably would not annotate anything to a WBGene ID because we couldn't be certain that the annotation would be correct for all of the genes.

The way our WB protein IDs work right now, though, if we ever needed to specifically indicate, for example, the histone protein encoded by the his-2 locus, I don't believe we could do it since that histone protein sequence is shared amongst 15 different loci. In practice, this hasn't happened that much for WB GO curation, but maybe it's worth thinking about if/how that could be handled in the future. @khowe what are your thoughts?

cmungall commented 6 years ago

Which GPI docs are you referring to? I regard the formal spec in markdown as canonical: https://github.com/geneontology/go-annotation/blob/master/specs/gpad-gpi-1_2.md

multiple parents are allowed. There is shocklingly little docs on the semantics of this field, but the intent was analogous to GFF3/Chado

Somehow the docs also spread to the wiki and drupal these may be out of sync, we have not done a good jon of coordinating this

khowe commented 6 years ago

@cmungall this one: http://www.geneontology.org/page/gene-product-information-gpi-format

cmungall commented 6 years ago

Thanks. We need to unify these

vanaukenk commented 6 years ago

@cmungall Unfortunately, I'm not sure if any groups have referred to the md version of the gpad/gpi documentation as the official documentation. We really do need to sort this all out before onboarding more groups. Let me know if/how I can help.

ukemi commented 6 years ago

When I wrote our requirements doc for the GPI file. I used this: http://www.geneontology.org/page/gene-product-information-gpi-format I'm not even sure the md file was available at that point. At any rate, we assumed the one on the GO web site was the official specs.

ukemi commented 6 years ago

But it appears that we also updated this page:

http://wiki.geneontology.org/index.php/Proposed_GPI1.2_format

khowe commented 6 years ago

Okay, most of these issues have already been fixed in the latest WormBase GPI:

ftp://ftp.ebi.ac.uk/pub/databases/wormbase/releases/WS265/species/c_elegans/PRJNA13758/annotation/c_elegans.PRJNA13758.WS265.gene_product_info.gpi.gz

This will be propagated to ftp.wormbase.org with a release-neutral URL in the next few weeks.

The duplication issue is still present. I hear @cmungall 's assertion that the github version of the spec is authoritative, and will make the change to have one line for each protein, with multiple Parents where appropriate. However, I am somewhat confused by multiple (different) versions of the spec floating around that all call themselves version "1.2".

pgaudet commented 4 years ago

Can this be closed ?

kltm commented 4 years ago

I have no memory of this. Closing.

geneontology / go-site

WormBase GPI is splitting some lines causing neo pipeline to break #595