INL / corpus-frontend

BlackLab Frontend, a feature-rich corpus search interface for BlackLab.
16 stars 7 forks source link

Weird concordances when [s] has subtags #122

Closed JessedeDoes closed 6 years ago

JessedeDoes commented 6 years ago

https://portal.clarin.inl.nl/atocorp/j.de.does@umail.leidenuniv.nl:EindhovenTest3/search/hits?number=20&first=0&patt=%5Bpos%3D%22SPEC.%2Adeel.%2A%22%5D

image

XML:

[s xmlns="http://www.tei-c.org/ns/1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:ivdnt="http://www.ivdnt.org/xslt/namespaces" xmlns:tei="http://www.tei-c.org/ns/1.0"] [w pos="VG(onder)" lemma="aangezien" type="710" xml:id="w.0"]Aangezien[/w] [w pos="LID(bep,stan)" lemma="de" type="370" xml:id="w.1"]de[/w] [name key="wet-administratieve-rechtspraak-overheidsbeschikkingen" resp="namenKlus"] [w pos="SPEC(deeleigen)" xml:id="w.2.part.0" type="010"]Wet[/w] [w pos="SPEC(deeleigen)" xml:id="w.2.part.1" type="010"]Administratieve[/w] [w pos="SPEC(deeleigen)" xml:id="w.2.part.2" type="010"]Rechtspraak[/w] [w pos="SPEC(deeleigen)" xml:id="w.2.part.3" type="010"]Overheidsbeschikkingen[/w] [/name] [w pos="VZ(init)" lemma="op" type="600" xml:id="w.3"]op[/w] [w pos="TW(hoofd,prenom,stan)" lemma="1" type="470" xml:id="w.4"]1[/w] [name key="juli" resp="namenKlus"] [w pos="N(eigen,ev,stan)" xml:id="w.5.part.0" type="010"]juli[/w] [/name] [w pos="ADJ(prenom,basis,met-e)" lemma="a.s." type="103" xml:id="w.6"]a.s.[/w] [w pos="VZ(init)" lemma="in" type="600" xml:id="w.7"]in[/w] [w pos="N(soort,ev,e-nom,stan,x-basis)" lemma="werking" type="000" xml:id="w.8"]werking[/w] [w pos="WW(pv,e-hulp-of-koppel,tgw,3,ev)" lemma="zullen" type="273" xml:id="w.9"]zal[/w] [w pos="WW(inf,e-intrans,vrij)" lemma="treden" type="200" xml:id="w.10"]treden[/w] [pc xml:id="w.11" pos="LET()"],[/pc] [w pos="WW(pv,e-hulp-of-koppel,tgw,3,ev)" lemma="kunnen" type="273" xml:id="w.12"]kan[/w] [w pos="VNW(aanw,det,stan,prenom)" lemma="dit" type="370" xml:id="w.13"]dit[/w] [w pos="N(soort,ev,e-nom,stan,x-basis)" lemma="artikel" type="000" xml:id="w.14"]artikel[/w] [w pos="WW(inf,e-intrans,vrij)" lemma="vervallen" type="200" xml:id="w.15"]vervallen[/w] [pc xml:id="w.16" pos="LET()"].[/pc] [/s]

JessedeDoes commented 6 years ago

Oops: this is because there is no white space between the tags in the source file

JessedeDoes commented 6 years ago

Actually, the word order is confused here. The content in the [name] tag appears after the rest of the sentence Maybe my indexing specification is to blame?


 default namespace)

# What element starts a new document?
# (the only absolute XPath; the rest is relative)
documentPath: //TEI|//TEI.2

# Annotated, CQL-searchable fields (also called "complex fields").
# We usually have just one, named "contents".
annotatedFields:

  contents:

    # How to display the field in the interface (optional)
    displayName: Contents

    # How to describe the field in the interface (optional)
    description: Contents of the documents.

    # What element (relative to document) contains this field's contents?
    # (if omitted, entire document is used)
    containerPath: .//body

    # What are our word tags? (relative to container)
    wordPath: .//w|.//pc     # (body geldt niet voor OpenSonar, maar ter illustratie)

    # Punctuation between word tags (relative to container)
    punctPath: .//text()[not(ancestor::w or ancestor::pc)]   # = "all text nodes (under containerPath) not inside a  element"

    # What annotation can each word have? How do we index them?
    # (annotations are also called "(word) properties" in BlackLab)
    # (valuePaths relative to word path)
    # NOTE: forEachPath is NOT allowed for annotations, because we need to know all annotations before indexing,
    #       and with forEachPath you could run in to an unknown new annotation mid-way through.
    annotations:
    - name: word
      valuePath: .
    - name: lemma
      valuePath: "@lemma"
    - name: pos
      valuePath: "@pos"
    - name: morfcode
      valuePath: "@type"

    # XML tags within the content we'd like to index
    # (relative to container)
    inlineTags:
    - path: .//s
      #call: openSonarSentence  # to call a plugin method for this tag
    - path: .//p
    - path: .//name

# FoLiA's native metadata
metadata:
  containerPath: //listBibl[@type='metadata']
  fields:
  - forEachPath: bibl/interpGrp/interp
    namePath: ../@type                    # interpGrp/@type
    valuePath: .                  # interp/@value
]]>

JessedeDoes commented 6 years ago

Sorry, included yaml is a mess

jan-niestadt commented 6 years ago

Strange. I suspect the difference in nesting level of word tags, combined with how vtd-xml returns matches, is to blame. We'll investigate.

JessedeDoes commented 6 years ago

Looks OK now in "EindhovenTest6": https://portal.clarin.inl.nl/atocorp/j.de.does@umail.leidenuniv.nl:EindhovenTest6/search/hits?number=20&first=0&patt=%5Bpos%3D%22SPEC.%2Adeel.%2A%22%5D