Exmaralda-Org / exmaralda

26 stars 15 forks source link

Event-based segmentation #470

Open sarkipo opened 5 months ago

sarkipo commented 5 months ago

From May 2023 meeting minutes:

TS: explore the possibility of event-based segmentation. That would eliminate the need for HIAT-based FSM segmentation and allow more flexible transcription conventions @TS Note: current transcription conventions vary to some extent between INEL corpora but the core is documented in https://doi.org/10.14232/wpcl.2020.5 (with a summary of symbols in the end)

sarkipo commented 5 months ago

(I thought there was already an issue on that but haven't found one)

berndmoos commented 5 months ago

First attempt:

This segmentation works, not by FSM, but by an XSL stylesheet. The general approach is:

To do: take care of segmentation errors. There are two options:

berndmoos commented 5 months ago

it is anything but fast

It can take up to two hours for gigantic (INEL) transcripts, so it needs to be implemented differently

berndmoos commented 5 months ago

Please also add : (colon) ; (semicolon) “ (left double quotation mark) ” (right double quotation mark)

Pending decision on hyphens...

sarkipo commented 5 months ago

Also word-external: « U+00AB (LEFT-POINTING DOUBLE ANGLE QUOTATION MARK) » U+00BB (RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK) ‐ U+2010 HYPHEN ‑ U+2011 NON-BREAKING HYPHEN

On the contrary, the usual - U+002D HYPHEN-MINUS should be word-internal. @git-debase-verbose has just seen that

<ts e="T223" id="tx.w222" n="INEL:w" s="T222">Dʼɨllara</ts>
<nts id="nts_tx.e222_2" n="INEL:ip">-</nts>
<ts e="T223" id="tx.w222" n="INEL:w" s="T222">kunnere</ts>

makes Tsakorpus unhappy.

berndmoos commented 4 months ago

Different implementation (no XSL) now takes seconds instead of minutes. To do / decide: What will count as a segmentation error? Or make the algorithm accept everything?

berndmoos commented 4 months ago

((xxx)). -- there can be utterance terminators after the closing parentheses

sarkipo commented 4 months ago

Not necessarily utterance terminators, since it can be in the middle of an utterance. Rather just any punctuation, e.g. ((…)), – or ((…)),. But terminators are perhaps 97% of all cases when something follows. (Also complex ones like ((…))?”).

git-debase-verbose commented 1 month ago

In https://github.com/Exmaralda-Org/exmaralda/blob/master/src/org/exmaralda/partitureditor/jexmaralda/segment/InelEventBasedSegmentation.java:

The exception type at line 105 ("Word characters after double closing round parentheses...") is created with null instead of tierID - so when I later call FSMException.getTierID on it, the result is null as well. Could you take a look on it?