Syntax pre-annotator: non-conventional CoNLL format

khoidt commented 6 years ago

Hi @chiarcos,

I found that the columns in the MTAAC Baseline Parser CoNLL data match neither CoNLL-U nor CDLI-CoNLL. The differences are the following:

CDLI-CoNLL: ID FORM SEGM XPOSTAG HEAD DEPREL MISC CoNLL-U: ID FORM LEMMA UPOSTAG XPOSTAG FEATS HEAD DEPREL DEPS MISC MTAAC Baseline Parser: ID WORD BASE CF EPOS FORM GW LANG MORPH MORPH2 NORM POS SENSE

It would be really great if you could make the script accept one of our conventional CoNLL formats -- or instruct me how to do this.

epageperron commented 6 years ago

The parser works for ETCSRI data, it should be adapted to process CDLI-CoNLL...

epageperron commented 6 years ago

Let me know if you would need a correspondence map for CDLI-CoNLL <==> ETCSRI

khoidt commented 6 years ago

Yes, thank you! Do you have it at hand?

epageperron commented 6 years ago

I don't think so, let me check in the files generated from the work with Lucas first but I can prepare one today or tomorrow !

chiarcos commented 6 years ago

Am .08.2018, 13:36 Uhr, schrieb khoidt notifications@github.com:

Hi @chiarcos,

I found that the columns in the MTAAC Baseline Parser CoNLL data match
neither CoNLL-U nor CDLI-CoNLL. The differences are the following:

CDLI-CoNLL: ID FORM SEGM XPOSTAG HEAD DEPREL MISC CoNLL-U: ID FORM LEMMA UPOSTAG XPOSTAG FEATS HEAD DEPREL DEPS MISC MTAAC Baseline Parser: ID WORD BASE CF EPOS FORM GW LANG MORPH MORPH2
NORM POS SENSE

It would be really great if you could make the script accept one of our
conventional CoNLL formats -- or instruct me how to do this. For reading CDLI-CoNLL, create a copy of parse-demo.sh, and replace line
31 with MTAAC-style column labels for CDLI-CoNLL data.

I don't thunk, the parser uses anything else but WORD, MORPH2 and POS
(tbc., though), so, roughly, this should be

ID WORD MORPH2 POS IGNORE IGNORE IGNORE

(tbc.)

If MTAAC MORPH2 annotations and CDLI SEGM (or other annotations) do not
exactly correspond to each other, add an additional SPARQL script in
parse.sh, line 13 that does the transformation, e.g., using replacements
with regular expressions, e.g.

DELETE { ?a ?prop ?oldx INSERT { ?a ?prop ?newx } WHERE { ?a ?prop ?oldx. BIND(regex(str(?oldx), 'N1=','-') AS ?newx) }

(Instead of ?prop, you should use the properties you need.) If transformation is needed, make sure to give the input column/property a
name that is not already used by the MTAAC Baseline Parser to avoid
confusion between CDLI input and parser-internal properties.

epageperron commented 6 years ago

ETCSRI WORD = MTAAC FORM

ETCSRI POS = MTAAC XPOSTAG In the XPOSTAG field, the POS or Named Entity tag replace the stem. I it might be easier to work from a list to extract it, but that can also be done with regex.

named entities list (https://docs.google.com/spreadsheets/d/1Is7MGG0h8h0vfHj9C9mnWOD2utPeuvm1ZeYb1dsaejg/edit#gid=0 starts at line 56)

ETCSRI MORPH2 = MTAAC XPOSTAG In MTAAC we loose positional data eg N1. NV2. etc. You would have to check in the script to see if they are used at all, we can add fake position tags to have the script run probably? I think you are in for a complex regex trip...

ETCSRI STEM = MTAAC POS tags ETCSRI NAME = MTAAC Names entities tags

Eg ETCSRI MORPH2 : V5=MID.V6=3-SG-H.V8=COM.V11=3-SG-H-A.V12=STEM.V14=3-SG-P.V15=SUB MTAAC XPOSTAG : MID.3-SG-H.COM.3-SG-H-A.V.3-SG-P.SUB

There are very slight differences in the morphology tags, you can see that here : ( differences are in Column A )

https://docs.google.com/spreadsheets/d/1y0_y9HDQNwH0VqDCjjYuUpFsugw4GEybu6Pu01I_D9c/edit?usp=drive_web&ouid=106167595571896896527

Bill will add a comment about the non-finite system we have implemented as a requirement since we are not using the positional data.

lukurkurra commented 6 years ago

Okay, about the non-finite system then. Both in ETCSRI and in Zolyomi 2017, Prof. Zolyomi employs a distinct tagging system for non-finite verbal chains which marks the verbal stem as functioning in a non-finite situation, and specifies 1 of 3 tenses: a non-finite with a preterite tense specification would be tagged NV2.NV4 at ETCSRI, and STEM.PT  in Zolyomi 2017.  With the aim of emulating Zolyomi's practices, we developed three tags that serve the same function as the three non-finite tags used by Zolyomi. They are:

NF.V.ABS NF.V.PT NF.V.F

The first part, NF is non-finite. V is the verbal stem. The final component relates to one of three tenses.

epageperron commented 6 years ago

NF.V.PF yes ?

lukurkurra commented 6 years ago

Well.. I much preferred NF.V.PF which we had at first - until discovering that PF was already being used to mark stems with a special maru form hm.

epageperron commented 6 years ago

Ha thanks !

epageperron commented 5 years ago

@khoidt what is the status of this issue?

cdli-gh / mtaac_work

Syntax pre-annotator: non-conventional CoNLL format #52