dasch-swiss / dsp-api

DaSCH Service Platform API
http://admin.dasch.swiss
Apache License 2.0
74 stars 18 forks source link

Implement Standoff Search #630

Closed tobiasschweizer closed 6 years ago

tobiasschweizer commented 6 years ago

Provide standoff search possibilities: search for a text that is marked up in a certain way.

An example in Sparql:

Search for the word "Mesure" that is marked up as italic and happens to be inside a paragraph. The paragraph does not need to be the immediate parent.

PREFIX standoff: <http://www.knora.org/ontology/standoff#>
PREFIX knora-base: <http://www.knora.org/ontology/knora-base#>
PREFIX beol: <http://www.knora.org/ontology/beol#>
select ?textValue ?markedup ?string where { 

    BIND("Mesure" AS ?searchVal)

    # use index for query optimisation 
    ?string <http://www.ontotext.com/owlim/lucene#fullTextSearchIndex> ?searchVal .

    ?textValue a knora-base:TextValue .

    ?textValue knora-base:valueHasString ?string .

    ?textValue knora-base:valueHasStandoff ?standoffNode .

    ?standoffNode a standoff:StandoffItalicTag .

    ?standoffNode knora-base:standoffTagHasStart ?start .

    ?standoffNode knora-base:standoffTagHasEnd ?end .

    # https://www.w3.org/TR/xpath-functions/#func-substring 
    # The first character of a string is located at position 1, not position 0. -> standoff uses a 0 based index
    BIND(SUBSTR(?string, ?start+1, ?end - ?start) AS ?markedup)

    FILTER REGEX(?markedup, ?searchVal, "i")

    ?standoffNode knora-base:standoffTagHasStartParent* ?standoffParentTag .
    ?standoffParentTag a standoff:StandoffParagraphTag .
} ORDER BY ?textValue ?start
LIMIT 100
tobiasschweizer commented 6 years ago

Basic idea: use Lucene index to filter out all the text values that do not contain the search term (for optimization). Then select those text values that have an italic standoff tag that contains the search term: first get the whole text marked up as italic and then check that it contains the search term using a FILTER with regex. Then check that the italic standoff node has some parent of type paragraph using property path syntax.

Performance: property path syntax is slow in our experience. So I expect queries making use of them to be slow.

Lucene and Regex: Both have their own syntax. We have to think about what possibilities we would like to offer to the user: Boolean Logic, wildcards etc.

tobiasschweizer commented 6 years ago

@benjamingeer suggests:

On GraphDB, we could add our own inference rule for standoff tags, so we could use inference instead of property path syntax.

Try adding this to KnoraRules.pie, just before the section "Knora-specific consistency checks". Then you'll have to restart GraphDB, then recreate the repository.

Id: standoff_containment
     x  <knora-base:standoffTagHasStartParent>  y    [Constraint x != y, x != z, y != z]
     y  <knora-base:standoffTagHasStartParent>  z
    -------------------------------
     x  <knora-base:standoffTagHasStartParent>  z

Then in your query, instead of this:

?standoffNode knora-base:standoffTagHasStartParent* ?standoffParentTag .

you should be able to write this:

?standoffNode knora-base:standoffTagHasStartParent ?standoffParentTag .

Keep in mind that if you want the immediate parent, you will now have to use http://www.ontotext.com/explicit.

tobiasschweizer commented 6 years ago

We should have a look at how the XML db existDB handles searches involving markup and literal text:

benjamingeer commented 6 years ago
?book incunabula:description ?description .
?description standoff:hasStandoff ?para .
?para a standoff:StandoffParagraphTag .

To match part of a text value, it looks like we might be able to implement custom functions using the RDF4J SPARQL parser:

http://docs.rdf4j.org/custom-sparql-functions/

Then in KnarQL, we could write filters like this:

FILTER(?para knora-api:contains("Zeitglöcklein"))

Otherwise, we could use statements instead of filters:

?para knora-api:contains "Zeitglöcklein" .
tobiasschweizer commented 6 years ago

Maybe we have to provide a custom implementation of the Lucene Indexer (for graphdb: org.apache.lucene.analysis.Analyzer, com.ontotext.trree.plugin.lucene.AnalyzerFactory, http://graphdb.ontotext.com/documentation/standard/full-text-search.html#creating-an-index).

mattssp commented 6 years ago

Or rather a Tokenizer. An other problem is that depending on the type of markup, the sequence of the plain text may not be the relevant one for tokenizing. There may be parts of the text that are comments, or there may be constructs like deletions: Zeitglöck<del>chen</del><add>lein</add>. This should be tokenized as "Zeitglöckchen" and "Zeitglöcklein" (yes, i would want to find the deleted word, too), not "Zeitglöckchenlein". This can only be done if the Tokenizer/Analyzer understands the markup. The Analyzer would have to consider the onthology of the standoff tags to be able to do this.

benjamingeer commented 6 years ago

@mattssp But if the plain text contains Zeitglöckchen Zeitglöcklein, then you can't search for Zeitglöcklein des Lebens.

Another way would be to separate different variants into different resources.

The first resource could represent the diplomatic transcription (with Zeitglöckchenlein, and the markup showing the addition and deletion).

Then you could have different resources for different variants, e.g. one would contain Zeitglöckchen, and another would contain Zeitglöcklein.

That would make all the variants searchable, without the need to customise the full-text search engine (which we can perhaps do with GraphDB using Lucene, but perhaps not with other triplestores).

mattssp commented 6 years ago

@benjamingeer that would introduce a lot of redundancy. Still, I see your point. Perhaps there could be a way to mark up search terms for complex sequences via an additional standoff markup layer. These could be indexed easily.

mattssp commented 6 years ago

Some sort of preprocessor that can be parametrized by the mapping could create this upon creation of the resource.

benjamingeer commented 6 years ago

If I understand your idea correctly, I think the problem is that, in general, it's not possible to predict which sequences of words people will want to search for.

I agree that it is best to avoid redundancy when possible. But on the other hand, I think that it's often not possible to find a single data representation that will meet all needs. For example, people who do quantitative analyses often need something like a spreadsheet that can be fed into statistical software like R. Here the only solution is to generate such a spreadsheet for the purpose of running the analysis. One of our goals in API v2 is to facilitate such scenarios.

Similarly, I doubt that there is a single representation of text with markup that will satisfy everyone. I suspect that in some cases, it will always be necessary to convert text from one form to another before analysing it, e.g. to extract an edited text from a transcription.

Also, we have to consider trade-offs between storage and performance. Eliminating redundancy reduces storage requirements. But storage is cheap, and often it's not easy to get acceptable performance in complex RDF searches. It can be worth using more storage to make searches perform better.

And given our limited resources, we have to consider the development effort that would be necessary to produce a more complex implementation. If we store the actual text that we want to search (e.g. the edited text), then we can use Lucene (and other similar products) to search it, without any additional development effort.

Therefore I'm inclined to think that it's worth storing edited text separately from transcriptions.

tobiasschweizer commented 6 years ago

I am getting back to standoff, finally :-)

Consider the following query (in contrast to the one above https://github.com/dhlab-basel/Knora/issues/630#issue-264502893):

PREFIX standoff: <http://www.knora.org/ontology/standoff#>
PREFIX knora-base: <http://www.knora.org/ontology/knora-base#>
PREFIX beol: <http://www.knora.org/ontology/beol#>
select DISTINCT ?resource ?textValue ?start ?end ?markedup ?markedup2 ?start2 ?end2 ?string where { 

    BIND("Numerum quemcunque esse summam tot quadratorum" AS ?searchVal)

    # use index for query optimisation 
    ?string <http://www.ontotext.com/owlim/lucene#fullTextSearchIndex> ?searchVal .

    ?textValue a knora-base:TextValue .

    ?resource knora-base:hasValue ?textValue .

    ?textValue knora-base:valueHasString ?string .

    ?textValue knora-base:valueHasStandoff ?standoffNode .

    ?standoffNode a standoff:StandoffUnderlineTag .

    ?standoffNode knora-base:standoffTagHasStart ?start .

    ?standoffNode knora-base:standoffTagHasEnd ?end .

    # https://www.w3.org/TR/xpath-functions/#func-substring 
    # The first character of a string is located at position 1, not position 0. -> standoff uses a 0 based index
    BIND(SUBSTR(?string, ?start+1, ?end - ?start) AS ?markedup)

    FILTER REGEX(?markedup, ?searchVal, "i")

    ?textValue knora-base:valueHasStandoff ?standoffNode2 .

    ?standoffNode2 a standoff:StandoffParagraphTag .

    ?standoffNode2 knora-base:standoffTagHasStart ?start2 .

    ?standoffNode2 knora-base:standoffTagHasEnd ?end2 .

    # https://www.w3.org/TR/xpath-functions/#func-substring 
    # The first character of a string is located at position 1, not position 0. -> standoff uses a 0 based index
    BIND(SUBSTR(?string, ?start2+1, ?end2 - ?start2) AS ?markedup2)

    FILTER REGEX(?markedup2, ?searchVal, "i")
} ORDER BY ?textValue ?start
LIMIT 100

The query searches for a string that is both marked up as underlined and a paragraph, but does not say that there is a relation between underline and paragraph (e.g., if you think about different standoff layers). The problem here, however, is that the two matches couldn't be related at all if the string occurs several times in the same text value. I think we would have to check that the start and end indexes are related (they are identical or one range is contained in the other).

tobiasschweizer commented 6 years ago

57a9b8569b8d7a7ed55147034fac4486b25c96e4 provides the functionality to restrict a full text search to a certain standoff class.

benjamingeer commented 6 years ago

We could make a property standoffTagHasStartAncestor, a base property of standoffTagHasStartParent. We could even make it an owl:TransitiveProperty. In GraphDB, we wouldn't need to use a complete set of OWL inference rules; we could just add the rule for owl:TransitiveProperty from builtin_owl2-rl.pie to KnoraRules.pie:

Id: prp_trp
  p <rdf:type> <owl:TransitiveProperty>
  x p y
  y p z
  -------------------------------
  x p z
benjamingeer commented 6 years ago

@tobiasschweizer Could you write:

  1. A sample Gravsearch query that looks at standoff nodes using the complex schema.
  2. The SPARQL prequery that should result from (1).
tobiasschweizer commented 6 years ago

@benjamingeer Yes, I think I could do that. I think the prequery should contain what we already have for the fulltext search: https://github.com/dhlab-basel/Knora/blob/adeb458b5f0aa3a6f85a12a749b25e13d21bd2c2/webapi/src/main/twirl/queries/sparql/v2/searchFulltextGraphDB.scala.txt#L73-L95

And parts of this code block would have to be generated automatically, Gravsearch does not contain it (substring handling).

benjamingeer commented 6 years ago

To filter on a StandoffDateTag (or a subclass of it), I think we would need to be able to write something like this in Gravsearch:

PREFIX knora-api: <http://api.knora.org/ontology/knora-api/simple/v2#>
PREFIX knora-api-c: <http://api.knora.org/ontology/knora-api/v2#>
PREFIX beol: <http://0.0.0.0:3333/ontology/0801/beol/simple/v2#>
PREFIX beol-c: <http://0.0.0.0:3333/ontology/0801/beol/v2#>

CONSTRUCT {
    ?letter knora-api:isMainResource true .
} WHERE {
    ?letter a beol:letter .
    ?letter beol-c:hasText ?text .
    ?text knora-api-c:hasStandoff ?date .
    ?date a knora-api-c:StandoffDateTag .
    FILTER(knora-api-c:date(?date) < “JULIAN:1492”^^knora-api:Date)
}

Something like the knora-api-c:date function above would be needed so the FILTER could compare a standoff date tag with a date literal.

benjamingeer commented 6 years ago

But now I realise that it’s actually not correct that a letter (simple schema) could have the property hasText (complex schema). So maybe it would make more sense to write the whole query in the complex schema, and use the simple schema only in FILTERs:

PREFIX knora-api-simple: <http://api.knora.org/ontology/knora-api/simple/v2#>
PREFIX knora-api: <http://api.knora.org/ontology/knora-api/v2#>
PREFIX beol: <http://0.0.0.0:3333/ontology/0801/beol/v2#>

CONSTRUCT {
    ?letter knora-api:isMainResource true .
} WHERE {
    ?letter a beol:letter .
    ?letter beol:hasText ?text .
    ?text knora-api:hasStandoff ?date .
    ?date a knora-api:StandoffDateTag .
    FILTER(knora-api:date(?date) < “JULIAN:1492”^^knora-api-simple:Date)
}
benjamingeer commented 6 years ago

The type checker could make sure that you don’t mix schemas, by checking that there is only one schema used in each statement.

benjamingeer commented 6 years ago

Perhaps the conversion from complex to internal wouldn’t be difficult. We could just forbid the use of dateValueHasYear, dateValueHasMonth, etc.

benjamingeer commented 6 years ago

So with the current design, the question is what should AbstractSparqlTransformer.handleQueryVar do in this case:

PREFIX knora-api: <http://api.knora.org/ontology/knora-api/v2#>
PREFIX anything: <http://www.knora.org/ontology/0001/anything#>

CONSTRUCT {
    ?thing knora-api:isMainResource true .
    ?thing anything:hasInteger ?intVal .
} WHERE {
    ?thing a anything:Thing .
    ?thing anything:hasInteger ?intVal .
    ?intVal knora-api:intValueAsInt ?int .
    FILTER(?int < 3)
}

Here the type of ?int is xsd:integer, so handleQueryVar would assume it refers to an IntValue, and add extra statements to make the FILTER work. But here that wouldn't make sense, and actually there's nothing for handleQueryVar to do.

I think the simplest way to handle this would be just to detect that the complex schema is being used in the query (the parser could set a flag for that), and if so, disable the automatic generation of additional statements in handleQueryVar. We would only need to generate them in the case of the date function I suggested above.

benjamingeer commented 6 years ago

After #899, I think what's left for this is: