Closed matyaskopp closed 1 year ago
Is there any good reason for cropping IDs of tokens in vert format?
Yes, with (very) large corpora, noSkE dies while compiling a corpus if you have complete IDs for every word. The concordancer is not really meant to have a different value for evey token in the corpus. Not sure if ParlaMint-en (and -xx) are over the limit, but they could be with over 1 billion tokens.
So far the only reason we had IDs for tokens at all was to enable some weird CQP queries for syntax which take advantage of them, but there it is enough to have the uniq inside sentences.
Is any compromise possible for TEITOK, e.g. to have them uniq within a document? Or I could try leaving them alone at least for individual corpora, but crop for -en and -xx?
One way would also be to have a bit more complicated query to get the token, so, instead of e.g. [id="ParlaMint-UA_2023-02-07-m0.u287.p1.s1.w1"]
you would have something like [id="w1"] within <s id="ParlaMint-UA_2023-02-07-m0.u287.p1.s1"/>
Is any compromise possible for TEITOK, e.g. to have them uniq within a document?
We discussed it, and for TEITOK purposes, we still need to use both versions vert
(for metadata - to have the same metadata as they are encoded/named in NoSKE) and TEI.ana
.
So there is no trouble with IDs for us, I was just curious why this happened.
My idea is to store whole IDs in order to allow linking among applications, but as the original IDs are reconstructable it does not be to be an issue.
it does not be to be an issue.
OK. So, close?
Is there any good reason for cropping IDs of tokens in vert format? https://github.com/clarin-eric/ParlaMint/blob/3d9af64d2c19d747397916346f5d643b7926f009/Samples/ParlaMint-UA/ParlaMint-UA_2015-12-08-m1.vert#L11-L13
should be ideally encoded this way:
I came to this when discussing conversion to TEITOK format with @maartenpt.