INL / BlackLab

Linguistic search for large annotated text corpora, based on Apache Lucene
http://inl.github.io/BlackLab/
Apache License 2.0
103 stars 53 forks source link

How to deal with punctuations when indexing using sketch-wpl #529

Closed ShiroAkasegawa closed 1 month ago

ShiroAkasegawa commented 1 month ago

I have a question about indexing using sketch-wpl on blacklab 3.0.1. When I index the following data, I get no hits with the search expression "teacher" </s>. But "teacher" "." </s> will give a hit.

How can I ignore punctuations? Please let me know.

<doc id="G001.001.001"level="beginner" sound_id="00001"> <s id="G001.001.001" pattern_id="G001" subpattern_id="G001.001" level="beginner"> The DT the footstep NN footstep was VBD be that DT that of IN of a DT a teacher NN teacher <g/> . SENT . </s> </doc>

jan-niestadt commented 1 month ago

Unfortunately, this is currently not easily possible with the sketch-wpl format, as it encodes punctuation as a separate token. You could of course use a query like "teacher" [pos="SENT"]? </s>, but I can understand that's a bit cumbersome.

An alternative is to convert your data to an XML format such as tei-p5, where you can choose not to have punctuation as a separate token. A default configuration is included with BlackLab, but I recommend making a copy you can adapt to your needs.

Your example might be encoded as:

<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:id="G001.001.001">
  <teiHeader>
    <fileDesc>
      <!-- ...file metadata... -->
    </fileDesc>
  </teiHeader>
  <text>
    <body>
      <s xml:id="G001.001.001">
        <w lemma="the" pos="DT">The</w>
        <w lemma="footstep" pos="NN">footstep</w>
        <w lemma="be" pos="VBD">was</w>
        <w lemma="that" pos="DT">that</w>
        <w lemma="of" pos="IN">of</w>
        <w lemma="a" pos="DT">a</w>
        <w lemma="teacher" pos="NN">teacher</w>.
      </s>
    </body>
  </text>
</TEI>

(Full TEI documentation)

As you can see, the . is outside the <w/> tags. It will be indexed in a separate annotation punct, with each token storing the punctuation preceding it (there's a 'dummy token' at the end of the document that stores any punctuation after the last word).

Hope this helps!

ShiroAkasegawa commented 1 month ago

Thank you for your prompt response.

I would like to use tei-p5 instead of sketch-wpl.

Thank you very much for your kind advice.