Closed ShiroAkasegawa closed 1 month ago
Unfortunately, this is currently not easily possible with the sketch-wpl
format, as it encodes punctuation as a separate token. You could of course use a query like "teacher" [pos="SENT"]? </s>
, but I can understand that's a bit cumbersome.
An alternative is to convert your data to an XML format such as tei-p5
, where you can choose not to have punctuation as a separate token. A default configuration is included with BlackLab, but I recommend making a copy you can adapt to your needs.
Your example might be encoded as:
<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:id="G001.001.001">
<teiHeader>
<fileDesc>
<!-- ...file metadata... -->
</fileDesc>
</teiHeader>
<text>
<body>
<s xml:id="G001.001.001">
<w lemma="the" pos="DT">The</w>
<w lemma="footstep" pos="NN">footstep</w>
<w lemma="be" pos="VBD">was</w>
<w lemma="that" pos="DT">that</w>
<w lemma="of" pos="IN">of</w>
<w lemma="a" pos="DT">a</w>
<w lemma="teacher" pos="NN">teacher</w>.
</s>
</body>
</text>
</TEI>
As you can see, the .
is outside the <w/>
tags. It will be indexed in a separate annotation punct
, with each token storing the punctuation preceding it (there's a 'dummy token' at the end of the document that stores any punctuation after the last word).
Hope this helps!
Thank you for your prompt response.
I would like to use tei-p5
instead of sketch-wpl
.
Thank you very much for your kind advice.
I have a question about indexing using sketch-wpl on blacklab 3.0.1. When I index the following data, I get no hits with the search expression "teacher" </s>. But "teacher" "." </s> will give a hit.
How can I ignore punctuations? Please let me know.
<doc id="G001.001.001"level="beginner" sound_id="00001"> <s id="G001.001.001" pattern_id="G001" subpattern_id="G001.001" level="beginner"> The DT the footstep NN footstep was VBD be that DT that of IN of a DT a teacher NN teacher <g/> . SENT . </s> </doc>