Closed TomazErjavec closed 10 months ago
I am in favour of keeping separators the same because it is starting to be quite complicated:
,
for separating tags/
inside tags for searchingSo I am suggesting three columns (word candidate
):
G1.2/S2mf,I3.1/S2mf,P1/S2mf,A7+,A1.2+
G1.2/S2
Politics|People
(not sure about separator and column name)In the meantime I did improve the code for converting USAS to vdertical a bit, and this is the current snippet from the registry that gives the names of the attributes and the multi-value separators:
ATTRIBUTE usas_tags {
TYPE "MD_MGD"
LABEL "USAS tags"
MULTIVALUE yes
MULTISEP ","
}
ATTRIBUTE usas_cats {
TYPE "MD_MGD"
LABEL "USAS categories"
MULTIVALUE yes
MULTISEP " "
}
ATTRIBUTE usas_full {
TYPE "MD_MGD"
LABEL "USAS glosses"
MULTIVALUE yes
MULTISEP "|"
}
Maybe this should be changed (but I'd change it only once becaue all the vertical files need to be recompiled), maybe like this:
/
as multisep for usas_cats as you suggestedusas_full
to usas_glosses
Anyway, I am open to suggestions, we can still change this. Btw. to me "usas" seemed better than "sem", because it is more specific. And if people don't know usas is semantics, then they probably won't be able to use these tags anyway. Still, not sure here either, what do you think?
A test corpus with only 3 ccorpora is available for testing on https://www.clarin.si/ske-beta/#dashboard?corpname=parlamint40_xx_en
One thing that doesn't work, and it is a big shame, is keywords over usas_glosses. I made a covid subcorpus and computed keywords agains the complete corpus over usas_cats, which works fine, and usas_glosses which returns no results. But the two attributes are isomorhpic, i.e. 1 usas_cat corresponds to 1 usas_full. I have no idea why it doesn't work, maybe I need to write to Lexical Computing...
- use
/
as multisep for usas_cats as you suggested
Now I have discovered, that using /
inside values is not good, because it is default noSketch separator.
usas_tags/usas_cats/usas_glosses
- rename
usas_full
tousas_glosses
I like usas_glosses
it is more understandable for me
Now I have discovered, that using / inside values is not good, because it is default noSketch separator.
Good point. Won't change it.
I like usas_glosses it is more understandable for me
OK, changed in ab73fda. And, sorry, was working directly on main branch, will merge it into devel and switch.
This is now finished. The "/" still conflicts with noSkE delimiter, as it appears in USAS tags but I think it could cause even more confusion if it were changed, as this is the conjunctive delimiter in USAS, and changing it would confuse people looking at the USAS specs
Closing.
The conversion of TEI to vertical files should also implement USAS semantics for tokens and MWEs. Ideally:
e.g.
Currently the code for converting TEI to vertical is just a stub, for
<phr>
elements: https://github.com/clarin-eric/ParlaMint/blob/301752bbc71521cb485611085a77694b5e561a52/Scripts/parlamint2xmlvert.xsl#L146-L162 and for positional attributes: https://github.com/clarin-eric/ParlaMint/blob/301752bbc71521cb485611085a77694b5e561a52/Scripts/parlamint-lib.xsl#L882-L895If anybody, esp. @matyaskopp or @perayson have any oppinion on this, I'd be glad to hear it.