freme-project / basic-services

Apache License 2.0
0 stars 1 forks source link

[Pipelines] pipeline with nif-version=2.1 fails #113

Open jnehring opened 8 years ago

jnehring commented 8 years ago

This curl

curl -X POST -H "Content-Type: application/json" -H "Cache-Control: no-cache" -H "Postman-Token: cc799c16-5d39-accf-b81d-1aa4a48fb5c9" -d '[
{
  "method": "POST",
  "endpoint": "https://api-dev.freme-project.eu/current/e-entity/freme-ner/documents",
  "parameters": {
    "language": "en",
    "dataset": "dbpedia",
    "nif-version": "2.1"
  },
  "headers": {
    "content-type": "text/html",
    "accept": "text/turtle"
  },
  "body": "<p>This summer there is the Zomerbar in Antwerp, one of the most beautiful cities in Belgium.</p>"
},
{
  "method": "POST",
  "endpoint": "https://api-dev.freme-project.eu/current/e-terminology/tilde",
  "parameters": {
    "source-lang": "en",
    "target-lang": "de",
    "nif-version": "2.1"
  },
  "headers": {
    "content-type": "text/turtle",
    "accept": "text/html"
  }
}
]
' "http://api-dev.freme-project.eu/current/pipelining/chain"

fails with error message

{
  "exception": "eu.freme.common.exception.InternalServerErrorException",
  "path": "/pipelining/chain",
  "message": "For input string: \"//freme-project.eu/#offset_38_45\"",
  "error": "Internal Server Error",
  "status": 500,
  "timestamp": 1475653048652
}

It works when I remove nif-version=2.1 from both API calls.

jnehring commented 8 years ago

The error seems to originate from within internationalization.

Error log:

ERROR   2016-10-05 11:06:28,805 [http-nio-8089-exec-1] eu.freme.bservices.controllers.pipelines.PipelinesController  - For input string: "//freme-project.eu/#offset_38_45"
java.lang.NumberFormatException: For input string: "//freme-project.eu/#offset_38_45"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:569)
        at java.lang.Integer.valueOf(Integer.java:766)
        at eu.freme.bservices.internationalization.okapi.nif.converter.HTMLBackConverter$TextUnitResource.<init>(HTMLBackConverter.java:450)
        at eu.freme.bservices.internationalization.okapi.nif.converter.HTMLBackConverter.listTextUnitResources(HTMLBackConverter.java:359)
        at eu.freme.bservices.internationalization.okapi.nif.converter.HTMLBackConverter.convertBack(HTMLBackConverter.java:165)
        at eu.freme.bservices.internationalization.okapi.nif.converter.HTMLBackConverter.convertBack(HTMLBackConverter.java:115)
        at eu.freme.bservices.internationalization.okapi.nif.converter.HTMLBackConverter.convertBack(HTMLBackConverter.java:82)
        at eu.freme.bservices.internationalization.api.InternationalizationAPI.convertBack(InternationalizationAPI.java:131)
        at eu.freme.bservices.controllers.pipelines.core.Conversion.convertBack(Conversion.java:62)
        at eu.freme.bservices.controllers.pipelines.core.PipelineService.chain(PipelineService.java:115)
        at eu.freme.bservices.controllers.pipelines.PipelinesController.pipeline(PipelinesController.java:92)
        at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.springframework.web.method.support.InvocableHandlerMethod.doInvok

and

ERROR   2016-10-05 11:06:28,807 [http-nio-8089-exec-1] eu.freme.common.exception.ExceptionHandlerService  - Request: http://rv1443.1blu.de:8089/pipelining/chain raised
eu.freme.common.exception.InternalServerErrorException: For input string: "//freme-project.eu/#offset_38_45"
        at eu.freme.bservices.controllers.pipelines.PipelinesController.pipeline(PipelinesController.java:120)
        at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
katia-vistatec commented 8 years ago

Hi, I tested using these two requests in sequence:

curl -X POST --header "Content-Type: text/html" --header "Accept: text/html" --header "Cache-Control: no-cache" --data "@input.txt" "http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents?language=en&dataset=dbpedia&mode=spot%2Clink&nif-version=2.1" > output.txt

curl -X POST --header "Content-Type: text/html" --header "Accept: text/html" --header "Cache-Control: no-cache" --data "@output.txt" "http://api-dev.freme-project.eu/current/e-terminology/tilde?source-lang=en&target-lang=de&nif-version=2.1" > out-output.txt

where input.txt is a file whose content type is text/html and output.txt is a file with content-type text/html (the output of the first request) and it is sent as input to the second request. The files are attached below. I don't have the error. Can you try again now?

katia-vistatec commented 8 years ago

input.txt output.txt out-output.txt

jnehring commented 8 years ago

The error happens when executing the pipeline. I could not reproduce it using individual curl commands. The pipeline does not convert from html -> turtle -> html in every step. The pipeline converts from html -> turtle in the beginning, then it performs all pipeline steps with turtle and in the end it converts back to html. So the CURL commands are

curl -X POST --header "Content-Type: text/html" --header "Accept: text/turtle" --header "Cache-Control: no-cache" --data "@input.txt" "http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents?language=en&dataset=dbpedia&mode=spot%2Clink&nif-version=2.1" > output.txt

curl -X POST --header "Content-Type: text/turtle" --header "Accept: text/turtle" --header "Cache-Control: no-cache" --data "@output.txt" "http://api-dev.freme-project.eu/current/e-terminology/tilde?source-lang=en&target-lang=de&nif-version=2.1"

Executing the two API requests one after another, it works. So I think the problem happens when the output HTML is created because we cannot reproduce this behaviour using separate curl requests.

katia-vistatec commented 8 years ago

Basing on the Log it is java.lang.NumberFormatException that occurs in HTMLBackConverter.java so in the second step of the pipeline when calling terminology and in particular when getting the begin index of "//freme-project.eu/#offset_38_45". I think there's some problem with the nif-version. So even if it is nif-version = 2.1, the parameter it is not received correctly and it defaults to version 2.0. So when it parses a nif version 2.1 thinking it's nif 2.0 it fails with number format exception when trying to get the begin index because it uses the wrong identifier. Maybe it is possible to add some log to verify the nif version.

katia-vistatec commented 8 years ago

I debugged locally using this curl:

curl -X POST -H "Content-Type: application/json" -H "Cache-Control: no-cache" -H "Postman-Token: cc799c16-5d39-accf-b81d-1aa4a48fb5c9" --data "@json.txt" "http://localhost:8080/pipelining/chain"

and with the json.txt attached (see below the attachement) in which I use http://localhost:8080/e-terminology/tilde as the endpoint. I found that the nif version parameter that arrives to the InternationalizationAPI.java method Reader convertBack(InputStream markupsFile, InputStream enrichedFile, String nifVersion) is null. This creates the above described problem since a nif 2.1 is handled as it were a nif 2.0 (when no value is set for nif-version parameter, the version defaults to 2.0 ), so the string freme-project.eu/#offset_38_45 is not parsed correctly causing the errors.

katia-vistatec commented 8 years ago

json.txt

jnehring commented 8 years ago

Thanks for the investigation. This is a tough bug. The pipeline itself has no idea of the nif version. We can only guess the nif version by analyzing the nif content. I chose another solution. I scan all pipeline requests and if one of the requests contains a parameter "nif-version" then I submit this nif version to e-internationalization. This implementation does not fix the bug currently, I need to debug it once again. Will do it on monday.

But I do not like this solution. Guessing the nif version from the content might be better. @m1ci do you know of an implementation that guesses the nif version that we can reuse here?

m1ci commented 8 years ago

@m1ci do you know of an implementation that guesses the nif version that we can reuse here?

yes, see https://api-dev.freme-project.eu/current/e-entity/freme-ner/documents?language=en&dataset=dbpedia&mode=all&outformat=turtle&informat=text&input=Diego%20Maradona%20is%20from%20Argentina.&nif-version=2.1

In the RDF you can see

<http://freme-project.eu/#collection>
        a               nif:ContextCollection ;
        nif:hasContext  <http://freme-project.eu/#offset_0_33> ;
        <http://purl.org/dc/terms/conformsTo>
                <http://persistence.uni-leipzig.org/nlp2rdf/ontologies/nif-core/2.1> .

which says that the context http://freme-project.eu/#offset_0_33 conforms to NIF 2.1. This should help. Also, we agreed that default version is 2.0, and by using the nif-version parameter one can set the version to 2.0 or 2.1.

jnehring commented 8 years ago

I added source code to guess the nif format. It checks if the version is annotated in the nif document. Further I added this code to the pipelines module. It still does not work although the error message has changed. A debug message says that it detects nif version 2.1 so it hands the right version internationalizationApi.convertBack(). The new version of pipelines is already merged in the master and installed on freme-dev. The problem can be reproduced with above curl request.

I think that the problem is now within e-internationalization.

The error stack trace is here: stacktrace.txt.

katia-vistatec commented 8 years ago

Hi Jan, debugging locally I found that the nifConvertedFile-skeleton it is being parsed to get the HTML file as a string has "#char=" and not #offset_ as expected. InternationalizationAPI method:convertToTurtleWithMarkups(InputStream is, String mimeType, String nifVersion) throws ConversionException has the parameter nifVersion null. So it is not set and the nif converted files produced are nif 2.0 version.

jnehring commented 8 years ago

Thank you for investigating on this. I think we should create a parameter nif-version for pipelines so we do not guess the parameter but explicitly set it. Therefore I created #115.

jnehring commented 8 years ago

I put the solution here and close #115

We need the nif-version parameter in pipelines as well. It determines the nif version that is submitted to e-Internationalization in the beginning and in the end of the pipeline. The nif-version parameter of individual pipeline requests is not influenced by the nif-version parameter of the pipeline. This will be a parameter similar to visibility or persist which gets his own field in the database. Currently it can values 2.0 and 2.1. This requires changes in

jnehring commented 8 years ago

For now we will not fix the bug.

katia-vistatec commented 8 years ago

Ok.

ArneBinder commented 8 years ago

@jnehring if this will be implemented and the pipeline model is changed, I think it would be really useful to put also useI18n into the pipeline.

But to fix this bug in general, I don't think so many changes are needed, just three files of the Pipelines service need minor changes. The parameter nif-version has to be added to the endpoints POST /pipelining/chain and POST /pipelining/chain/{id}, they just forward it to PipelineService.chain. In the roundtripping case, this method should default it to 2.0 if necessary, (then eventually put it into every single PipelineRequest and) the methods convertToNif here and convertBack here have to use it. convertToNif needs a minor modification, it just has to forward the parameter to convertToTurtleWithMarkups and convertToTurtle (by analogy to convertBack) which are called with null at the moment. So the same nif version is used for conversion and back conversion and no guessing is necessary. I dont know, if it should be possible or if it makes sense in any way to allow different nif versions within one single pipeline which does roundtripping, so I put it in brackets above. Do I miss something?