pipeline breaks with longer html

freme-project / basic-services

Apache License 2.0

0 stars 1 forks source link

pipeline breaks with longer html #93

Open fsasaki opened 8 years ago

fsasaki commented 8 years ago

See http://api.freme-project.eu/current/pipelining/chain/37 and the two attached requests. The only difference between the is the length of the files processed. The longer file is less than 100K - is this still an issue?

request-with-short-html-doc.txt request-with-long-html-doc.txt

jnehring commented 8 years ago

The first pipeline step detects 1052 named entities, the second creates 1052 sparql queries and sends them to dbpedia. This takes a long time. There is a timeout of 60 seconds configured.

I transfered the pipeline to freme-dev and changed the timeouts to 600 seconds in three places on freme-dev

apache mod_proxy
timeout of the rest controller in application.properties
timeout of requests in pipelines in the source code of PipelineService.java

Now the requests fail after 10 minutes. I am not sure how to deal with this. These timeouts make sense, but on the other hand it should still be possible to process large files.

The problematic service here is e-Link with the slow dbpedia endpoint. A client side solution that I did not explore yet is to download the entities via freme-ner and then send them in smaller batches to e-Link. A server side solution would be to set timeouts to 1 hour, or to load the dbpedia in our own triple store and hope that this improves response times.

jnehring commented 8 years ago

In last developers call @m1ci said he will check if the implementation of e-Link can be speed up somehow. Possibilities to explore from what I recall from the discussion:

1) reduce the number of sparql queries, by fetching information about multiple links in one go 2) implement caching / avoid redundant calls about the same link

jnehring commented 8 years ago

any update here?

jnehring commented 7 years ago

Pipeline 37 does not exist anymore. But one can reproduce the problem using this curl request.

The problem still occurs.

m1ci commented 7 years ago

just did an optimization update at e-link to perform enrichment only on unique entities. In other words, if there are multiple occurrences of a same entity, the enrichment will be performed only once. @jnehring can you please test now?

jnehring commented 7 years ago

I had issues executing the curl request I posted earlier. Therefore I created the pipeline id 56 on freme-dev that executes freme ner first and then e-link.

It still fails on the long document

curl -X POST -H "Content-Type: text/html" -d '@long-document.html' "http://api-dev.freme-project.eu/current/pipelining/chain/56"

@m1ci could you process the long document succesfuly?

Your update makes sense, even if it cannot process the long document. Since the http requests time out after a while there has to be a maximum length of text / maximum number of entities that the service can process.