geneontology / go-annotation

This repository hosts the tracker for issues pertaining to GO annotations.
BSD 3-Clause "New" or "Revised" License
34 stars 10 forks source link

Xenbase GPI file contains exotic characters, prevents processing #4642

Closed kltm closed 1 year ago

kltm commented 1 year ago

The Xenbase GPI file at https://ftp.xenbase.org/pub/GenePageReports/xenbase.gpi.gz contains exotic characters:

    Xenbase XB-GENE-29077855        or16n1  olfactory receptor family 16 subfamily N member 1       LOC100485521^M|LOC100485521     gene    taxon:8364              NCBI_Gene:100485521
    Xenbase XB-GENE-29077856        or16n1.L        olfactory receptor family 16 subfamily N member 1 L homeolog    LOC100485521^M|LOC100485521     gene    taxon:8355
    Xenbase XB-GENE-29077857        or16n1.L        olfactory receptor family 16 subfamily N member 1 S homeolog    LOC100485521^M|LOC100485521     gene    taxon:8355

It looks like some kind of newline trimming did not occur.

This currently prevents the processing of xenbase for the NEO data load (tagging @vanaukenk @suzialeksander ).

@malcolmfisher103 I wanted to see if this was on your radar.

malcolmfisher103 commented 1 year ago

Thanks for flagging this up @kltm we are rectifying this in the database and will add some new data hygiene to remove these characters in the future.

kltm commented 1 year ago

@malcolmfisher103 No problem! If you could tag us back when the new files are out, we'll retry to get a new version of NEO (i.e. noctua autocomplete) out.

malcolmfisher103 commented 1 year ago

@kltm I beleive that exotic character issue has been fixed now.

kltm commented 1 year ago

Terrific---thank you! Testing now.

malcolmfisher103 commented 1 year ago

@kltm I'm afraid we might break your load again, some extra columns seem to have sneaked in from the reworked script for generating the file.

kltm commented 1 year ago

@malcolmfisher103 No worries--thank you for letting me know. I'll suspend the test load for now.

kltm commented 1 year ago

N0action update; test load still indicating issues.

08:14:54  LINENO: 687060 - Clause: synonym; expected an xref list, or at least an empty list '[]' at pos: 23
08:14:54  LINE: synonym: "LOC100485521        org.semanticweb.owlapi.oboformat.OBOFormatOWLAPIParser.parse(OBOFormatOWLAPIParser.java:60)
kltm commented 1 year ago

@malcolmfisher103 I was wondering if there was any update on this? We are unable to build the NEO updates unless this is fixed, we drop the Xenbase file, or we create our own fixed file as the input. (Tagging @vanaukenk and @pgaudet )

malcolmfisher103 commented 1 year ago

@kltm I believe this issue is now resolved, I'm afraid Xenbase has been down for the last few days so I wasn't able to reverify until now.

kltm commented 1 year ago

@malcolmfisher103 Terrific! I'll start a retest now.

kltm commented 1 year ago

@malcolmfisher103 We're still seeing the same issue as the one originally reported:

16:26:41  LINENO: 687060 - Clause: synonym; expected an xref list, or at least an empty list '[]' at pos: 23
16:26:41  LINE: synonym: "LOC100485521        org.semanticweb.owlapi.oboformat.OBOFormatOWLAPIParser.parse(OBOFormatOWLAPIParser.java:60)

from GPI lines like:

Xenbase XB-GENE-29077855        or16n1  olfactory receptor family 16 subfamily N member 1       LOC100485521^M|LOC100485521     gene    taxon:8364              NCBI_Gene:100485521|UniprotKB:A0A8J0QLX0        
malcolmfisher103 commented 1 year ago

@kltm My apologies, I checked that our script and data were updated and producing the correct GPI, I didn't check that we had run the script to update the GPI file on our download site. I'll let you know when this has been done.

malcolmfisher103 commented 1 year ago

@kltm We have rerun the scripts to generate the file on the download site, please try again with this updated file.

kltm commented 1 year ago

Cheers! Retrying now.

kltm commented 1 year ago

@malcolmfisher103 Thank you--NEO production is now proceeding.