Open khoidt opened 6 years ago
The parser works for ETCSRI data, it should be adapted to process CDLI-CoNLL...
Let me know if you would need a correspondence map for CDLI-CoNLL <==> ETCSRI
Yes, thank you! Do you have it at hand?
I don't think so, let me check in the files generated from the work with Lucas first but I can prepare one today or tomorrow !
Am .08.2018, 13:36 Uhr, schrieb khoidt notifications@github.com:
Hi @chiarcos,
I found that the columns in the MTAAC Baseline Parser CoNLL data match
neither CoNLL-U nor CDLI-CoNLL. The differences are the following:CDLI-CoNLL: ID FORM SEGM XPOSTAG HEAD DEPREL MISC CoNLL-U: ID FORM LEMMA UPOSTAG XPOSTAG FEATS HEAD DEPREL DEPS MISC MTAAC Baseline Parser: ID WORD BASE CF EPOS FORM GW LANG MORPH MORPH2
NORM POS SENSEIt would be really great if you could make the script accept one of our
conventional CoNLL formats -- or instruct me how to do this. For reading CDLI-CoNLL, create a copy of parse-demo.sh, and replace line
31 with MTAAC-style column labels for CDLI-CoNLL data.
I don't thunk, the parser uses anything else but WORD, MORPH2 and POS
(tbc., though), so, roughly, this should be
ID WORD MORPH2 POS IGNORE IGNORE IGNORE
(tbc.)
If MTAAC MORPH2 annotations and CDLI SEGM (or other annotations) do not
exactly correspond to each other, add an additional SPARQL script in
parse.sh, line 13 that does the transformation, e.g., using replacements
with regular expressions, e.g.
DELETE { ?a ?prop ?oldx INSERT { ?a ?prop ?newx } WHERE { ?a ?prop ?oldx. BIND(regex(str(?oldx), 'N1=','-') AS ?newx) }
(Instead of ?prop, you should use the properties you need.)
If transformation is needed, make sure to give the input column/property a
name that is not already used by the MTAAC Baseline Parser to avoid
confusion between CDLI input and parser-internal properties.
ETCSRI WORD = MTAAC FORM
ETCSRI POS = MTAAC XPOSTAG In the XPOSTAG field, the POS or Named Entity tag replace the stem. I it might be easier to work from a list to extract it, but that can also be done with regex.
named entities list (https://docs.google.com/spreadsheets/d/1Is7MGG0h8h0vfHj9C9mnWOD2utPeuvm1ZeYb1dsaejg/edit#gid=0 starts at line 56)
ETCSRI MORPH2 = MTAAC XPOSTAG In MTAAC we loose positional data eg N1. NV2. etc. You would have to check in the script to see if they are used at all, we can add fake position tags to have the script run probably? I think you are in for a complex regex trip...
ETCSRI STEM = MTAAC POS tags ETCSRI NAME = MTAAC Names entities tags
Eg ETCSRI MORPH2 : V5=MID.V6=3-SG-H.V8=COM.V11=3-SG-H-A.V12=STEM.V14=3-SG-P.V15=SUB MTAAC XPOSTAG : MID.3-SG-H.COM.3-SG-H-A.V.3-SG-P.SUB
There are very slight differences in the morphology tags, you can see that here : ( differences are in Column A )
Bill will add a comment about the non-finite system we have implemented as a requirement since we are not using the positional data.
Okay, about the non-finite system then. Both in ETCSRI and in Zolyomi 2017, Prof. Zolyomi employs a distinct tagging system for non-finite verbal chains which marks the verbal stem as functioning in a non-finite situation, and specifies 1 of 3 tenses: a non-finite with a preterite tense specification would be tagged NV2.NV4 at ETCSRI, and STEM.PT in Zolyomi 2017. With the aim of emulating Zolyomi's practices, we developed three tags that serve the same function as the three non-finite tags used by Zolyomi. They are:
NF.V.ABS NF.V.PT NF.V.F
The first part, NF is non-finite. V is the verbal stem. The final component relates to one of three tenses.
NF.V.PF yes ?
Well.. I much preferred NF.V.PF which we had at first - until discovering that PF was already being used to mark stems with a special maru form hm.
Ha thanks !
@khoidt what is the status of this issue?
Hi @chiarcos,
I found that the columns in the MTAAC Baseline Parser CoNLL data match neither CoNLL-U nor CDLI-CoNLL. The differences are the following:
CDLI-CoNLL:
ID FORM SEGM XPOSTAG HEAD DEPREL MISC
CoNLL-U:ID FORM LEMMA UPOSTAG XPOSTAG FEATS HEAD DEPREL DEPS MISC
MTAAC Baseline Parser:ID WORD BASE CF EPOS FORM GW LANG MORPH MORPH2 NORM POS SENSE
It would be really great if you could make the script accept one of our conventional CoNLL formats -- or instruct me how to do this.