buda-base / lds-pdi

http://purl.bdrc.io BDRC Linked Data Server
Apache License 2.0
2 stars 0 forks source link

error searching in Etexts #123

Closed berger-n closed 5 years ago

berger-n commented 5 years ago

http://purl.bdrc.io/lib/chunksFacetGraph?LG_NAME=bo-x-ewts&I_LIM=500&L_NAME=%22%27od%20zer%22

returns error 500

MarcAgate commented 5 years ago

Precisely, the following templates are timing out: _Res_allTypes, ResoneType, chunksFacetGraph, chunksByEtextGraph, etextFacetGraph All of these are using jena: text and all templates not using jena text are responding. Only two templates using jena text are responding: _Etexts_contents, Etextscount @xristy if the issue eventually comes from jena text, is it possible that we have some outdated indexes or so (on some properties)?

eroux commented 5 years ago

I think if jena:text was broken no query would work at all...

berger-n commented 5 years ago

http://purl.bdrc.io/lib/rootSearchGraph?LG_NAME=bo-x-ewts&I_LIM=500&L_NAME=%22%27od%20zer%22 doesn't work anymore same error

MarcAgate commented 5 years ago

http://purl.bdrc.io/lib/rootSearchGraph?LG_NAME=bo-x-ewts&I_LIM=500&L_NAME=%22%27od%20zer%22 is back up again

MarcAgate commented 5 years ago

@eroux I just tested jean:text in chunkFacetGraph and it works fine. This property path ?s ((:workHasItemEtext/:itemHasVolume)/:volumeHasEtext)/:eTextResource ?etext . seems to be the one causing the issue:

this simplified query hangs forever:

CONSTRUCT
  {
    ?chunk rdf:type :EtextChunk .
    ?chunk :eTextHasChunk ?lit .
    ?chunk :seqNum ?seqNum .
    ?chunk :sliceStartChunk ?startChunk .
    ?chunk :sliceEndChunk ?endChunk .
    ?chunk tmp:forEtext ?etext .
    ?chunk tmp:forWork ?s .
    ?chunk tmp:workLabel ?workLabel .
    ?chunk :creatorMainAuthor ?author .
    ?chunk tmp:authorName ?author_name .
    ?chunk tmp:etextAbout ?about .
    ?chunk tmp:etextGenre ?genre .
    ?chunk :eTextTitle ?etextTitle .
    ?chunk :eTextVolumeIndex ?volIndex .
    ?chunk :eTextIsVolume ?isVolume .
  }
WHERE
  { ( ?chunk ?score ?lit )
            text:query        ( :chunkContents "\"'od zer\""@bo-x-ewts "highlight:" ) .
    ?etext  :eTextHasChunk    ?chunk .
    ?chunk  :seqNum           ?seqNum ;
            :sliceStartChunk  ?startChunk ;
            :sliceEndChunk    ?endChunk .

    ?s ((:workHasItemEtext/:itemHasVolume)/:volumeHasEtext)/:eTextResource ?etext .

    ?etext  :eTextTitle  ?etextTitle
    OPTIONAL
      { ?etext  :eTextIsVolume  ?isVolume }
  }
ORDER BY DESC(?score)
LIMIT   500

but when I comment the line that uses the property path above, it returns normally.

eroux commented 5 years ago

ok... I have to say it's very strange, this should be a straightforward thing... maybe reversing the path would work better? (having ?etext on the left and ?s to the right). ?s could probably also be renamed to ?work... also, what are the lines

    ?chunk tmp:workLabel ?workLabel .
    ?chunk :creatorMainAuthor ?author .
    ?chunk tmp:authorName ?author_name .
    ?chunk tmp:etextAbout ?about .
    ?chunk tmp:etextGenre ?genre .
    ?chunk :eTextVolumeIndex ?volIndex .

supposed to do?

eroux commented 5 years ago

oh, I see, these lines were in the original sorry

MarcAgate commented 5 years ago

What bothers me is that

CONSTRUCT
  {   
    ?s tmp:prop ?etext .    
  }
WHERE
  { 
  ?s ((:workHasItemEtext/:itemHasVolume)/:volumeHasEtext)/:eTextResource ?etext .   
  }
LIMIT   500

returns just fine...

MarcAgate commented 5 years ago

well, chunksFacetGraph works in fuseki endpoint: it just takes forever to return and we therefore got the following in the browser:

HttpException: -1 Unexpected error making the query: java.net.SocketTimeoutException: Read timed out
    org.apache.jena.sparql.engine.http.HttpQuery.rewrap(HttpQuery.java:373)
    org.apache.jena.sparql.engine.http.HttpQuery.execPost(HttpQuery.java:358)
xristy commented 5 years ago

Try putting the ?I_LIM in the text:query:

(?chunk ?score ?lit) text:query ( :chunkContents ?L_NAME ?I_LIM "highlight:") .

this will cause the Lucene to stop as soon as it finds ?I_LIM results. With the ?I_LIM in the outer sparql Lucene text:query works very hard to find 10,000 ye shes results before yielding to the outer sparql (the default limit in Lucene 10,000) which does take a while - 10's of seconds.

MarcAgate commented 5 years ago

I tried and that speeds up the process but it is still very slow. The request is still timing out. I think we have to redesign that query or this functionality.

xristy commented 5 years ago

The performance was improved by removing the none.opt file from the databases/bdrc/ dataset. Where do we stand now with this query and the issues that were raised?

MarcAgate commented 5 years ago

As far as I can tell, the issue is solved and we can again browse Etexts in the library. @berger-n Would you agree with that?

berger-n commented 5 years ago

yes I do