agroportal / project-management

Repository used to consolidate documentation about the AgroPortal project and track content related issues.
http://agroportal.lirmm.fr
7 stars 0 forks source link

AGROVOC does not pull automatically and does not parse anymore #178

Closed jonquet closed 2 years ago

jonquet commented 2 years ago

Since July 2021 version, AGROVOC does not parse anymore.

Capture d’écran 2022-01-11 à 17 58 51

Plus, we have a pullLocation for AGROVOC : http://data.agroportal.lirmm.fr/ontologies/AGROVOC/submissions/16?display=pullLocation But the ontology never get updated automatically (I have to do it manually each month).

January or February release is expected soon to make new tests.

syphax-bouazzouni commented 2 years ago

AGROV issue diagnostic

State of last submission

See status

image

See logs

AGROV last submission log file can be found at this path : /srv/ontoportal/data/repository/AGROVOC/16/parsing.log

Conclusion from logs

the error is an java.lang.OutOfMemoryError, like what we see in the screenshot above (where we restarted the parsing of the last submission) image

syphax-bouazzouni commented 2 years ago

AGROV issue diagnostic (follow)

The problem

AGROVOC can't be parsed because of an OutOfMemoryError

Solutions

1 - it's an owl api issue so it's needs to be transfered to there repo (todo)

2- i tried to change the format of the file from nd to nt and it's seems to work

it don't work with http://agrovoc.uniroma2.it/latestAgrovoc/agrovoc_lod.nq.zip but works with http://agrovoc.uniroma2.it/latestAgrovoc/agrovoc_lod.nt.zip image

but we have a new problem image

jonquet commented 2 years ago

I recall we had an issue with parsing the AGROVOC files with the OWL-API (see email discussion sept 2020) because of an import of the RDFS triples. For sure this was fixed in the nq version that we used to parse until July 2021. I am not sure we ever parse the nt file.

I am focusing on the OutOfMemory error for now.

jonquet commented 2 years ago

Relevant post about the issue: https://stackoverflow.com/questions/52712321/outofmemoryerror-when-joining-a-list-of-strings-in-java

It seems the OWL-API tries to create a string too large.

jonquet commented 2 years ago

Error (OutofMemory) reproduced by @jvendetti when parsing (the nq file) "outside" of AgroPortal stack. Note: the nt file parse.

jonquet commented 2 years ago

An update :

rapper: Serializing with serializer ntriples
rapper: Error -  - XML parser error: Char 0xFFFF out of allowed range
rapper: Error -  - XML parser error: PCDATA invalid Char value 65535
rapper: Failed to parse file /srv/ncbo/repository/AGROVOC/1/owlapi.xrdf rdfxml content
rapper: Parsing returned 8673139 triples

When generating a RDF/XML file with Protégé and re-opening this same file with Protégé the error shows up again, but this time with a line number :

image

Which bring us to the URI : http://aims.fao.org/aos/agrovoc/xDef_8f48da66

image
jonquet commented 2 years ago

Fixing the character allow parsing. We encounter then another issue described in the log :

image

Probably linked to the recent changes on indexing fields.

syphax-bouazzouni commented 2 years ago

indexing error fixed here https://github.com/ontoportal-lirmm/goo/commit/ba27011fd2b093ff04d522477010d146602d0b62

Agrovoc is now parsed, indexed but we had this issue in the diff process https://github.com/agroportal/documentation/issues/246

So I don't think that the automatic pull will work, to follow up in the future releases here https://github.com/agroportal/documentation/issues/251