Open fsasaki opened 8 years ago
The first pipeline step detects 1052 named entities, the second creates 1052 sparql queries and sends them to dbpedia. This takes a long time. There is a timeout of 60 seconds configured.
I transfered the pipeline to freme-dev and changed the timeouts to 600 seconds in three places on freme-dev
Now the requests fail after 10 minutes. I am not sure how to deal with this. These timeouts make sense, but on the other hand it should still be possible to process large files.
The problematic service here is e-Link with the slow dbpedia endpoint. A client side solution that I did not explore yet is to download the entities via freme-ner and then send them in smaller batches to e-Link. A server side solution would be to set timeouts to 1 hour, or to load the dbpedia in our own triple store and hope that this improves response times.
In last developers call @m1ci said he will check if the implementation of e-Link can be speed up somehow. Possibilities to explore from what I recall from the discussion:
1) reduce the number of sparql queries, by fetching information about multiple links in one go 2) implement caching / avoid redundant calls about the same link
any update here?
Pipeline 37 does not exist anymore. But one can reproduce the problem using this curl request.
The problem still occurs.
just did an optimization update at e-link to perform enrichment only on unique entities. In other words, if there are multiple occurrences of a same entity, the enrichment will be performed only once. @jnehring can you please test now?
I had issues executing the curl request I posted earlier. Therefore I created the pipeline id 56 on freme-dev that executes freme ner first and then e-link.
It still fails on the long document
curl -X POST -H "Content-Type: text/html" -d '@long-document.html' "http://api-dev.freme-project.eu/current/pipelining/chain/56"
@m1ci could you process the long document succesfuly?
Your update makes sense, even if it cannot process the long document. Since the http requests time out after a while there has to be a maximum length of text / maximum number of entities that the service can process.
See http://api.freme-project.eu/current/pipelining/chain/37 and the two attached requests. The only difference between the is the length of the files processed. The longer file is less than 100K - is this still an issue?
request-with-short-html-doc.txt request-with-long-html-doc.txt