cropped token ids in vert format

matyaskopp commented 1 year ago

Is there any good reason for cropping IDs of tokens in vert format? https://github.com/clarin-eric/ParlaMint/blob/3d9af64d2c19d747397916346f5d643b7926f009/Samples/ParlaMint-UA/ParlaMint-UA_2015-12-08-m1.vert#L11-L13

should be ideally encoded this way:

<s id="ParlaMint-UA_2015-12-08-m1.u1.p1.s1">
Шановні Шановні шановний    ADJ Case=Nom Number=Plur    ParlaMint-UA_2015-12-08-m1.u1.p1.s1.w1  amod    колега  NOUN    Animacy=Anim Case=Nom Gender=Masc Number=Plur   ParlaMint-UA_2015-12-08-m1.u1.p1.s1.w2
колеги  колеги  колега  NOUN    Animacy=Anim Case=Nom Gender=Masc Number=Plur   ParlaMint-UA_2015-12-08-m1.u1.p1.s1.w2  root    -   -   -   -

I came to this when discussing conversion to TEITOK format with @maartenpt.

TomazErjavec commented 1 year ago

Is there any good reason for cropping IDs of tokens in vert format?

Yes, with (very) large corpora, noSkE dies while compiling a corpus if you have complete IDs for every word. The concordancer is not really meant to have a different value for evey token in the corpus. Not sure if ParlaMint-en (and -xx) are over the limit, but they could be with over 1 billion tokens.

So far the only reason we had IDs for tokens at all was to enable some weird CQP queries for syntax which take advantage of them, but there it is enough to have the uniq inside sentences.

Is any compromise possible for TEITOK, e.g. to have them uniq within a document? Or I could try leaving them alone at least for individual corpora, but crop for -en and -xx?

One way would also be to have a bit more complicated query to get the token, so, instead of e.g. [id="ParlaMint-UA_2023-02-07-m0.u287.p1.s1.w1"] you would have something like [id="w1"] within <s id="ParlaMint-UA_2023-02-07-m0.u287.p1.s1"/>

matyaskopp commented 1 year ago

Is any compromise possible for TEITOK, e.g. to have them uniq within a document?

We discussed it, and for TEITOK purposes, we still need to use both versions vert (for metadata - to have the same metadata as they are encoded/named in NoSKE) and TEI.ana. So there is no trouble with IDs for us, I was just curious why this happened.

My idea is to store whole IDs in order to allow linking among applications, but as the original IDs are reconstructable it does not be to be an issue.

TomazErjavec commented 1 year ago

it does not be to be an issue.

OK. So, close?

clarin-eric / ParlaMint

cropped token ids in vert format #772