Annotating words with USAS

TomazErjavec commented 2 years ago

The USAS semantic tags will be encoded in a taxonomy (cf. #202), but there remains the question of how to encode these tags (or, rather, references to the IDs of the taxomomy categories) on word tokens. An important complication is that USAS can also tag multi-word expressions (MWEs).

One option would be to directly mark the USAS tag in w/@ana, and, for MWEs, introduce a new element (probably phr) and mark phr/@ana. However, there is a real danger that phr will at times conflict with name, leading to non-well formed XML or difficult fixes.

An alternative which does not have these problems is to use linkGrp, similarly to how we use it for syntax. Here the problem is that the link elements that we used so far inside linkGrp require at least two IDREFs as the value of their @target, but with USAS we will typically (except for MWEs) have only 1 IDREF. But this can be accommodated by using ptr instead of link (note that ptr/@targer can also have several IDREFs).

In line with this, the encoding (suitably simplified) could be like:

<s xml:id="s1">
 <w xml:id="t1">I</w>
 <w xml:id="t2">therefore</w>
 <w xml:id="t3">very</w>
 <w xml:id="t4">much</w>
 <w xml:id="t5">welcome</w>
 <w xml:id="t6">the</w>
 <w xml:id="t7">Government's</w>
 <w xml:id="t8">intention</w>
 <linkGrp type="USAS-SEM">
   <ptr ana="usas:Z8" target="#t1"/>
   <ptr ana="usas:Z5" target="#t2"/>
   <ptr ana="usas:A13.3" target="#t3 #t4"/>
   <ptr ana="usas:Q2.2" target="#t5"/>
   <ptr ana="usas:Z5" target="#t6"/>
   <ptr ana="usas:G1.1" target="#t7"/>
   <ptr ana="usas:X7p" target="#t8"/>
 </linkGrp>
</s>

@matyaskopp, do you see any problems with this suggestion?

matyaskopp commented 2 years ago

One option would be to directly mark the USAS tag in w/@ana, and, for MWEs, introduce a new element (probably phr) and mark phr/@ana. However, there is a real danger that phr will at times conflict with name, leading to non-well formed XML or difficult fixes.

Agree

An alternative which does not have these problems is to use linkGrp, similarly to how we use it for syntax. Here the problem is that the link elements that we used so far inside linkGrp require at least two IDREFs as the value of their @target, but with USAS we will typically (except for MWEs) have only 1 IDREF. But this can be accommodated by using ptr instead of link (note that ptr/@targer can also have several IDREFs).

using ptr is a nice hack but I have to admit that I don't like it either. This annoy me:

ptr/@ana looks like you are anotating the pointer but you want to annotate word or mwe
assume the previous item is OK, Does this <ptr ana="usas:A13.3" target="#t3 #t4"/> express that t3 and t4 form one mwe?

I think the best solution for this situation is to introduce a <standOff> annotations. I know that you did not want to introduce it before, but we will definitely introduce a kind of stand-off annotations with timelines and audio alignment, so it can probably solve this issue too.

<TEI>
<!-- ... -->
 <standOff type="USAS-SEM">
   <span ana="usas:Z8" target="#t1"/>
   <span ana="usas:Z5" target="#t2"/>
   <span ana="usas:A13.3" target="#t3 #t4"/>
   <span ana="usas:Q2.2" target="#t5"/>
   <span ana="usas:Z5" target="#t6"/>
   <span ana="usas:G1.1" target="#t7"/>
   <span ana="usas:X7p" target="#t8"/>
 </standOff>
</TEI>

TomazErjavec commented 2 years ago

using ptr is a nice hack but I have to admit that I don't like it either.

I would say it's a tweak, not a hack, and I didn't actually say I don't like it - in fact, I do!

ptr/@ana looks like you are anotating the pointer but you want to annotate word or mwe

But then you could say exactly the same for linkGrp/link/@ana, which we use already for syntax.

Does this express that t3 and t4 form one mwe?

Yes, exactly.

I think the best solution for this situation is to introduce standOff

Yikes! This would open a whole new can of worms:

standOff needs its separate TEI document, with its teiHeader, so we would introdduce a completely new way of encoding linguistic annotations from all the rest
If we were to use it for semantic annotations, why not for syntax, and, in fact, all other linguistic annotations. Not saying this is a completely crazy idea, but it would mean redesigning the complete ling. annotation, I don't think we have the energy and time for that.

we will definitely introduce a kind of stand-off annotations with timelines and audio alignment

Why? What is wrong with the Parla-CLARIN recommendation. Except that the description is rather brief... This is the way I encoded speech alignment in GosVL, cf. http://hdl.handle.net/11356/1444 and though to use the same system in ParlaMint. This is quite similar to what is proposed in the TEI-based ISO 24624:2016, although they do use annotationBlock to wrap elements.

matthewcoole commented 2 years ago

I think I should point out that the example I sent to @TomazErjavec may have caused confusion. MWEs may have different semantic tags for each token within them, the example above just coincidentally happened to have 2 tokens tagged A13.3.

Perhaps a separate <ptr> could be used for the MWE and we add something to the taxonomy for it e.g.

<ptr ana="usas:A13.3" target="#t3"/>
<ptr ana="usas:A13.3" target="#t4"/>
<ptr ana="usas:MWE" target="#t3 t4"/>

matyaskopp commented 2 years ago

But then you could say exactly the same for linkGrp/link/@ana, which we use already for syntax.

I don't think so. We annotate the link between two entities (not child or parent node) in the annotation of syntactic relation. But in USAS case we want to annotate the node (word or mwe) not the ptr.

Does this express that t3 and t4 form one mwe?

Yes, exactly.

How I understand this <ptr ana="usas:A13.3" target="#t3 #t4"/>:

it represents two pointers(relations with USAS taxonomy), both are labeled with A13.3
it says nothing about relation between t3 and t4

Yikes! This would open a whole new can of worms

standOff needs its separate TEI document, with its teiHeader, so we would introdduce a completely new way of encoding linguistic annotations from all the rest

I don't think that standOff needs a separate TEI document. It can be placed at the end of TEI document if it fulfils bold condition:

source: https://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SASOstdf

As a member of model.resource, standOff may occur as a child of TEI (or teiCorpus). If the metadata that describes the standOff is largely the same as the metadata that describes the associated resource (e.g., the transcribed text in text), then the standOff and the encoded associated resource may appear as children of the same TEI element. The example below has a transcription with <placename> elements in the text linked to a list of place elements in the standOff section.

If we were to use it for semantic annotations, why not for syntax, and, in fact, all other linguistic annotations. Not saying this is a completely crazy idea, but it would mean redesigning the complete ling. annotation, I don't think we have the energy and time for that.

for syntax - we are annotating arrow (the relation), not starting and ending nodes
morphology can be implemented this way, but it is usually a single-word expression and if multiword one it does not break XML tree structure. And it is faster for processing.
named entities can break tree structure if there are other mwe (such as hypertext links - I get rid of them in ParlaMint-CZ data)

we will definitely introduce a kind of stand-off annotations with timelines and audio alignment

Why? What is wrong with the ...

sorry, this was a misleading comparison

I am not sure if we can agree now. I hope it will be brighter tomorrow.

matyaskopp commented 2 years ago

I think I should point out that the example I sent to @TomazErjavec may have caused confusion. MWEs may have different semantic tags for each token within them, the example above just coincidentally happened to have 2 tokens tagged A13.3.

@matthewcoole I used your online tagger http://ucrel-api.lancaster.ac.uk/usas/tagger.html, and it made some think that I finally understand the point of mwe. you are annotating both:

mwe with one semantic tag
words that are part of mwe with different tags

I have tried this sentence: The Manx name of the Isle of Man is Ellan Vannin. And the vertical output is:

0000001 002  -----   -----                    
0000003 010  AT      The                      Z5 
0000003 020  JJ      Manx                     Z99 
0000003 030  NN1     name                     Q2.2 
0000003 040  IO      of                       Z5 
0000003 050  AT      the                      Z5 
0000003 060  NNL1    Isle                     Z2[i1.3.1 W3 
0000003 070  IO      of                       Z2[i1.3.2 Z5 
0000003 080  NP1     Man                      Z2[i1.3.3 
0000003 090  VBZ     is                       A3+ Z5 
0000003 100  NP1     Ellan                    Z1mf[i2.2.1 Z3c[i2.2.1 
0000003 110  NP1     Vannin                   Z1mf[i2.2.2 Z3c[i2.2.2 
0000003 111  .       .

We can decomposite this issue into two:

how can mwe be implemented in ParlaMint in general (not only USAS case)? I mean mwe that can break tree structure (not named entities)
how to encode USAS tags

And then, we can annotate words in @ana attribute: <w ana="usas:A3+ usas:Z5">is</w> and mwes with <PSEUDO_MWE_ELEMENT ana="usas:..."/>

matthewcoole commented 2 years ago

I think that makes sense. I should've been clearer. MWEs should have the same tags I think, but the tokens themselves may have other tags, as in the Isle of Man example above. Correct me if I'm wrong @perayson .

TomazErjavec commented 2 years ago

OK, if I try to summarize, also to check if I understand:

single words will get one or more USAS annotations
an USAS annotation can be further decomposed into a hierarchically defined class and optional modifiers
MWEs can also be annotated as a whole, but their individual words will still get USAS annotations.

As for the encoding:

In-line annotation is problematic because MWEs need a wrapper element, and this might conflict with NE annotations
linkGrp is problematic as we would need to use ptr, but its semantics is wrong (in the bright light of a new day I agree with @matyaskopp)
standOff is problematic because, even though it can be in the same TEI as the text, it is not inside the sentence, as is the case for the otherwise similar syntactic annotation, but will appear at the end of the TEI; it also introduces a completel new encoding of linguistic annotations , which is is not covered by the Parla-CLARIN recommendations, and not currently used in ParlaMint.

All in all I somewhat prefer 1. above - it is the simplest in terms of encoding, and implementing a script that in one way or another fixes conflicts between name and phr should not be that hard - because any such conflicts mean that one annotation was in error, we do not spoil the annotations.

matyaskopp commented 2 years ago

All in all I somewhat prefer 1. above - it is the simplest in terms of encoding, and implementing a script that in one way or another fixes conflicts between name and phr should not be that hard - because any such conflicts mean that one annotation was in error, we do not spoil the annotations.

Agree.

For the record, full sample of The Manx name of the Isle of Man is Ellan Vannin. sentence:

<s>
  <w ana="usas:Z5">The</w>
  <w ana="usas:Z99">Manx</w>
  <w ana="usas:Q2.2">name</w>
  <w ana="usas:Z5">of</w>
  <w ana="usas:Z5">the</w>
  <phr ana="usas:Z2">
    <w ana="usas:W3">Isle</w>
    <w ana="usas:Z5">of</w>
    <w>Man</w>
  </phr>
  <w ana="usas:A3+ usas:Z5">is</w> <!-- pointing to invalid ID, but it is not the point of this example -->
  <phr ana="usas:Z1mf usas:Z3c">
    <w>Ellan</w>
    <w>Vannin</w>
  </phr>
  <pc>.</pc>   
</s>

perayson commented 2 years ago

I just answered something along similar lines in #202. The tagger itself will output all the possible semantic tags for each word and MWE, but following contextual disambiguation, the first tag in the list should be the most likely. So, we could simplify things for ParlaMint and just provide the first choice tag and remove the remainder. That will give us a certain level of accuracy, but reduce recall of course.

TomazErjavec commented 1 year ago

So, it is now settled that USAS annotation:

will not mark up discontinuous MWEs
can mark up continuous MWEs
is composed of a list of tags, ordered by likelyhood that the tag is correct, i.e. the first tag is the most likely
the tags are composed of alphanumeric characters and full stop. An inspection of tag sequences also shows usage of +, -, /, %, and space.

I would split this encoding question into two parts, how to encode USAS tags in CoNLL-U (relevant for @perayson ), and how in TEI, which will be generated from CoNLL-U (for @matyaskopp and @TomazErjavec).

In CoNLL-U, it is agreed that the USAS tags go into the MISC column. The simple way is to support only mark-up of individual words, but, as long as we have MWEs, we could mark this up as well (under the assumption that two MWEs never overlap). For this the standard was is to use the IOB encoding, which we already use for NER markup in CoNLL-U. This means that

each token should have a semantic attribute assigned, let's call it SEM
if the token (like punctuation, function words) does not have assigned a semantic tag, then it is Outside, i.e. it is marked up as SEM=O
if the token is the first (often only) one marked up with this tag sequence, it is marked up as Beginning, i.e. SEM=B-<tags>
if the token is the n-th (second, third, ...) one marked up with this tag sequence, it is marked up as Inside, i.e. SEM=I-<tags>
USAS uses space as the sequence delimiter; space is a bit dangerous, so I propose we change it to comma

So, we would have something like (with irrelevant columns skipped and randomly picked USAS tags):

# sent_id = ParlaMint-GB_2015-01-05-commons.seg1.2
# text = What progress her Department has made on implementing exit checks at borders.
1       What            NER=O|SEM=O
2       progress        NER=O|SEM=B-A10-,A12-
3       her             NER=O|SEM=O
4       Department      NER=O|SEM=B-A11.1+
5       has             NER=O|SEM=O
6       made            NER=O|SEM=B-A1.1.1,G2.2-,X9.2+,E3-,N5+,G2.1%,Z5
7       on              NER=O|SEM=O
8       implementing    NER=O|SEM=B-A1.1.1,H1%
9       exit            NER=O|SEM=B-A11.1+,H2
10      checks          NER=O|SEM=I-A11.1+,H2
11      at              NER=O|SEM=O
12      borders         NER=O|SpaceAfter=No|SEM=B-ZZ2
13      .               NER=O|SEM=O

@matyaskopp, do you agree?

As for the TEI encoding, I would postpone this discussion until the CoNLL-U format is finalised.

matyaskopp commented 1 year ago

the tags are composed of alphanumeric characters and full stop. An inspection of tag sequences also shows usage of +, -, /, %, and space.

Tags do not contain spaces, spaces are tag separators, so they can be replaced with character that is not present in tag, e.g. ,

10     checks                   X2.4/A5.3 A15 Q1.2 T2- M5

can be encoded this way:

10     checks                   SEM=B-X2.4/A5.3,A15,Q1.2,T2-,M5

In CoNLL-U, it is agreed that the USAS tags go into the MISC column. The simple way is to support only mark-up of individual words, but as long as we have MWEs, we could mark this up as well (under the assumption that two MWEs never overlap).

True if "overlap" means that the tags on the second,... positions have the same "semantic segmentation". The second and latter tags describe the same span (mwe, or single token) as the first one

Otherwise, we have to add B- and I- prefixes to all tags:

9       exit            NER=O|SEM=B-A11.1+,B-H1
10      checks          NER=O|SEM=I-A11.1+,B-H2

matyaskopp commented 1 year ago

Now I see my previous comment and example, it seems that there are nested semantic spans:

0000001 002  -----   -----                    
0000003 010  AT      The                      Z5 
0000003 020  JJ      Manx                     Z99 
0000003 030  NN1     name                     Q2.2 
0000003 040  IO      of                       Z5 
0000003 050  AT      the                      Z5 
0000003 060  NNL1    Isle                     Z2[i1.3.1 W3 
0000003 070  IO      of                       Z2[i1.3.2 Z5 
0000003 080  NP1     Man                      Z2[i1.3.3 
0000003 090  VBZ     is                       A3+ Z5 
0000003 100  NP1     Ellan                    Z1mf[i2.2.1 Z3c[i2.2.1 
0000003 110  NP1     Vannin                   Z1mf[i2.2.2 Z3c[i2.2.2 
0000003 111  .       .

TomazErjavec commented 1 year ago

@matyaskopp it looks like you either haven't seen or read what I wrote, because:

Tags do not contain spaces, spaces are tag separators, so they can be replaced with character that is not present in tag, e.g. ,

Yes, that is why I wrote "tag sequences", and also proposed substituting spaces with commas.

Otherwise, we have to add B- and I- prefixes to all tags:

I alredy proposed adding B and I prefixes to all the tags, we need them to be able to properly encode MWEs (even without nesting). We cannot count on two neighbouring words to always have different tags, also, we already encode NEs with IOB, so it is only sensible to encode semantic annotations in the same way.

it seems that there are nested semantic spans

If this is true (and maybe @perayson can confirm), then we do have a problem. One option would of course be to ignore the inner tags, the same way we do for NER (well, except CZ). If this is for some reason not possible, we have to think again...

matyaskopp commented 1 year ago

@matyaskopp it looks like you either haven't seen or read what I wrote, because:

Tags do not contain spaces, spaces are tag separators, so they can be replaced with character that is not present in tag, e.g. ,

Yes, that is why I wrote "tag sequences", and also proposed substituting spaces with commas.

Sorry, I haven't understood it properly I imagined a vertical sequence of tags that correspond to multiple tokens. Now I see what you meant "tag sequence" is a list of "tag alternatives"

perayson commented 1 year ago

The format with the 10 digits at the start of each line comes from the C version of USAS, and the addition of sequences like [i1.3.2 was intended to show the MWE sequences: [i is the separator, 1 = ID unique within the current file, 3 = this MWE is three words long, and 2 = this is the second element of the MWE sequence. PyMUSAS doesn't produce this format, see https://ucrel.github.io/pymusas/usage/how_to/tag_text instead for output examples. I think the BIO format is a good idea as long as we say that it represents information about the most likely (i.e. first) tag in the list. i.e. if the word is the part of the MWE, then we use B and I for internal MWE parts, otherwise O. The remainder of the tag list might include other (non-preferred) MWE tags or single word tags (but we ignore the MWE information about the non-preferred tags).

Note that punctuation also gets a Z9 tag in wmatrix6, PyMUSAS itself (at the moment) outputs PUNCT (which I should change). Function words will also get other Z tags. SEM=O-X3.4 will indicate that X3.4 is a single word semantic tag.

TomazErjavec commented 1 year ago

PyMUSAS doesn't produce this format

OK, this is a relief!

see https://ucrel.github.io/pymusas/usage/how_to/tag_text instead for output examples

Is there maybe a document explaining the format? Namely, the only output example I understand there is for English, and that one is rather short. In particular I'm still not completely clear on whether all words inside a MWE have the same tags, or can they differ, so that the complete MWE has one tag, but there can be further word-specific tags in the tag list, so that in effects tags can be nested. By what you write below, my guess is that the second in the case, so I contnue with that understanding.

I think we then have two options:

1 use IOB for all sem tags, and always keep only the first tag. The fact that something is a MWE or not is distinguished by the I tag on the second, third etc. MWE token. As you write that all tokens get a sem tag, we would not in fact have any O tags 2 use two attributes, say 1) SEMMWE for MWEs, which uses IOB format, it then contains the first (MWE) tag only, if it is inside a MWE (B for first, I for internal), and O otherwise, and (2) a SEM tag for single word tag lists, which does not need to be IOB, but just gives a list of tags for each token (without the MWE tag)

So, let's say we have a sentence "a b c", with b and c being a MWE. a has sem tags 1,2, b has 3, c has 4, while the MWE tag is M, so that b has the list "M,3" and c has "M,4".

So, the first option would give:

a   SEM=B-1
b   SEM=B-M
c   SEM=I-M

and the second would be:

a   SEMMWE=O SEM=1,2
b   SEMMWE=B-M SEM=3
c   SEMMWE=I-M SEM=4

I hope it is more or less clear what I mean....

I am slightly in favour of the first, as it is simpler, but the second does not lose any information. Thoughts?

Note that punctuation also gets a Z9 tag in wmatrix6, PyMUSAS itself (at the moment) outputs PUNCT (which I should change).

OK, so I guess this means that all tokens get a sem tag, nice.

matyaskopp commented 1 year ago

I am slightly in favour of the first, as it is simpler, but the second does not lose any information. Thoughts?

I see another option: 3 use IOB for all sem tags (O will never be used if all tokens are tagged). Add I- or B- prefix to all tags and tags separate with a comma:

a   SEM=B-1,B-2
b   SEM=B-M,B-3
c   SEM=I-M,B-4

it is simple encoding and the MWE will never overlap with the different MWE. You can always get the first option from it: s/,.*$//

OK, so I guess this means that all tokens get a sem tag, nice.

yes because we don't have syntactic analysis for the translated version (so no syntactic tokens).

TomazErjavec commented 1 year ago

I see another option: use IOB for all sem tags (O will never be used if all tokens are tagged). Add I- or B- prefix to all tags and tags separate with a comma

Wow, good one! It is a slight perversion over the usual IOB rules but I think it solves all the problems, so I also vote for it!

perayson commented 12 months ago

I'm not sure that I understand option 3 because it makes sense to me that a word which is not part of a MWE should be labelled O (not B). To answer an earlier question, for semantic MWEs, each word has the same semantic field tag. So, for those, I would use B for the first part, and I for the rest. My preferred option is to keep the tags and MWE markers separate (as per the current pymusas output), so here is option 4:

a   SEMMWE=O SEM=1,2
b   SEMMWE=B SEM=M,3
c   SEMMWE=I SEM=M,4

TomazErjavec commented 11 months ago

I'm not sure that I understand option 3 because it makes sense to me that a word which is not part of a MWE should be labelled O (not B).

The idea is that there is no formal distinction between single and multi word expressions, the only difference is that tags for single words will always be B, while for MWEs the second, third etc. word will be I. And, as every token gets a semantic tag, there are no tokens oustide some semantic tag, so nothing is marked with O.

As for your suggestion, the use of IOB here is different from what it is otherwise, e.g. in our NER annotation. If you want to split MWE annotations from single word annotations, a more common way would be:

a   SEMMWE=O SEM=1,2
b   SEMMWE=B-M SEM=3
c   SEMMWE=I-M SEM=4

Still, not to overcomplicate: why don't you do it in the way that you feel most comfortable with, and then me and @matyaskopp can, if necessary, modify it to be the most in line with our NER annotation?

perayson commented 11 months ago

Thanks, so I've tweaked the tagging script to produce this format:

# sent_id =     ParlaMint-ES-CT_2022-01-25-2201.1.0.8.1
# source =      Sí; senyor Garriga ?
# text =        Yes, Mr. Garriga?
1       Yes     yes     INTJ    UH      _       0       _       _       ForwardAlignment=1|BackwardAlignment=1|NER=O|SpaceAfter=No|SEMMWE=O|SEM=Z4
2       ,       ,       PUNCT   ,       _       1       _       _       ForwardAlignment=2|BackwardAlignment=2|NER=O|SEMMWE=O|SEM=Z9
3       Mr.     Mr.     PROPN   NNP     Number=Sing     2       _       _       ForwardAlignment=3|BackwardAlignment=3|NER=O|SEMMWE=B|SEM=Z1mf,Z3c
4       Garriga Garriga PROPN   NNP     Number=Sing     3       _       _       ForwardAlignment=4|BackwardAlignment=4|NER=B-PER|SpaceAfter=No|SEMMWE=I|SEM=Z1mf,Z3c
5       ?       ?       PUNCT   .       _       4       _       _       ForwardAlignment=5|BackwardAlignment=5|NER=O|SEMMWE=O|SEM=Z9

If you approve, then I can start to set up the tagging jobs. By the way, please can you confirm where I should download the final MT CONLLU format data from?

matyaskopp commented 11 months ago

If you approve, then I can start to set up the tagging jobs.

Agree with this format

By the way, please can you confirm where I should download the final MT CONLLU format data from?

You can download it directly from the repository (@TomazErjavec please confirm):

Kuzman, Taja; et al., 2023, Linguistically annotated multilingual comparable corpora of parliamentary debates in English ParlaMint-en.ana 3.0, Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1810.

Note that there will be changes/updates and newly added corpora. We will send them once they are ready (partners will deliver data and @TajaKuzman will translate it):

[ ] AT (updated)
[ ] CZ (updated)
[ ] ES (newly added)
[ ] HU (updated)
[ ] TR (updated)
[ ] UA (updated)

So please start with the stable corpora.

perayson commented 11 months ago

Thanks, I will make a start on the first one next week, and then I'll have a better estimate of running times

TomazErjavec commented 11 months ago

Yes, http://hdl.handle.net/11356/1810 has the official files for 3.0. But you also have them at https://nl.ijs.si/et/tmp/ParlaMint/MT/CoNLL-U-en/ Maybe better as a rule take them from there, in case any changed (AT has been already) or are added. Once you have even a few files, if you could send a sample, so we try implementing the conll2tei2vertical pipelne. For sending the files, best would be a http server with tar.gz files, or anything else that wget understands.. Also, what about ParlaMint-GB? That one obviosly doesn't have a translation, so I guess we take the originals. Which are of course sligtly differen but nowhere critical I think.

perayson commented 11 months ago

Thanks, I've tagged LV first, and the resulting tgz file is here: https://ucrel.lancs.ac.uk/paul/parlamint/PyMUSASTagged/ I'll keep going with countries not in @matyaskopp's list of six, and taking files from your nl.ijs.si server. If you can check the format, and confirm you're happy for it as input to you conll2tei2vertical pipeline, then I'll parallelise further to speed up. Last night, it looks like approximately 3M words/hour. Re ParlaMint-GB, I could take those, but do you want to harmonise to the same CONLLU format first?

TomazErjavec commented 11 months ago

Thanks, I've dowloaded the file and had a peek. At first glance it looks ok, except for a few details:

You have insterted a tab in the metadata fields, e.g.

# sent_id =     ParlaMint-LV_2021-01-07-PT13-2173-U1-P1.1
# source =      Labrīt, godātie deputāti!
# text =        Good morning, honourable Members!

The format should be as in the originals, i.e. exactly one space on each side of the equals sign.

If possible, I'd prefer the same filenames as in the originals, i.e. without the -pymusas, as the pipeline is set up for the original names.

More later, if anything else turns out, but I think it is all ok, except if @matyaskopp disagrees.

For speed, I hope it works out, the complete corpus is 1,3 billion words. As for GB, no, we won't be harmonising the CoNLL-U, as the way GB has it is the way all original corpora have it. But all the relevant bits for you are the same in both cases, so I hope that won't be a problem.

matyaskopp commented 11 months ago

I have one note: USAS files contain different pos and xtag, not sure if we agreed on that

TomazErjavec commented 11 months ago

USAS files contain different pos and xtag

Well spotted! Indeed, we haven't agreed on that, and at least UPOS tag should be as in the original. Maybe XPOS too.

TomazErjavec commented 11 months ago

Noticed one more thing, extra quotes which should be removed: MT input: # text = What, please, is "for"? USAS output: # text = "What, please, is ""for""?"

matyaskopp commented 11 months ago

Noticed one more thing, extra quotes which should be removed: MT input: # text = What, please, is "for"? USAS output: # text = "What, please, is ""for""?"

Confirming illustration:

perayson commented 11 months ago

Many thanks both for spotting all these, the issues hadn't emerged from the smaller tests I'd done so far. I should be able to fix the tabs and the filenames.

For the UPOS and XPOS tags, these are emerging from spacy's pipeline and influence the semtag decisions, whereas I think you said the originals were coming from Stanza? I'd argue to keep the spacy versions since the stanza ones will already be in the released data, and this is the semtag pipeline. If you insist on the stanza ones, I should be able to swap them back in.

I'll check on the extra quotes, I think that's due to the csv writer I'm using. I'll wait for the slightly larger PT set to run today, do you want to look at those too? And then I'll update the script tomorrow and share revised versions.

matyaskopp commented 11 months ago

For the UPOS and XPOS tags, these are emerging from spacy's pipeline and influence the semtag decisions, whereas I think you said the originals were coming from Stanza? I'd argue to keep the spacy versions since the stanza ones will already be in the released data, and this is the semtag pipeline. If you insist on the stanza ones, I should be able to swap them back in.

Thinks that should be considered:

The input is already influenced by stanza by tokenization (not sentence segmentation which came from various tools that are used in the original ParlaMint-XX corpus)
SpaCy is already used in ParlaMint-GB - so we will have one tool used for all ParlaMint corpora if we choose SpaCy
Stanza has significantly better performance on '2018 UD Shared Task' data for English https://aclanthology.org/2020.acl-demos.14.pdf
Have Stanza and SpaCy upos and xpos set?

I am slightly in favour of annotating everything with Stanza if it is okay for USAS. (and ask @nljubesi to annotate ParlaMint-GB with Stanza).

it would be cleaner to have everything annotated with one tool.

TomazErjavec commented 11 months ago

Thanks, @matyaskopp for the analysis. I agree, even though I doubt that @nljubesi would have the time to annotate GB.

I'd argue to keep the spacy versions since the stanza ones will already be in the released data, and this is the semtag pipeline. If you insist on the stanza ones, I should be able to swap them back in.

My idea was that the stanza ones with added USAS will be in the released data, as otherwise we cannot include USAS in the release CoNLL-U files, which would be a shame. So, yes, if you could swap them back in that would be great.

In the meantime I enhanced cp-conllu.pl to be able to deal with the extraneous quotes, so you don't actually have to bother with that.

perayson commented 11 months ago

Hi both, I've now fixed the tabs, kept the original filenames, retained the original UPOS and XPOS columns, removed the quoted quotes, and rerun the LV data. The result is here again: https://ucrel.lancs.ac.uk/paul/parlamint/PyMUSASTagged/ for you to check and confirm. Thanks!

TomazErjavec commented 11 months ago

Thanks, checked, looks good to me! Pls. feel free to start with the other corpora.

matyaskopp commented 11 months ago

@perayson, I have spotted one more issue, you have changed the lemma:

Other examples of lemma differences: (I transformed your data back to input data - s/|SEMMWE.*$// )

< ENGLISH
> USAS

< 8 DID did VERB    VBD Mood=Ind|Number=Sing|Person=2|Tense=Past|VerbForm=Fin   7   _   _   ForwardAlignment=9|BackwardAlignment=9|NER=O
> 8 DID do  VERB    VBD Mood=Ind|Number=Sing|Person=2|Tense=Past|VerbForm=Fin   7   _   _   ForwardAlignment=9|BackwardAlignment=9|NER=O
< 15    's  's  PART    POS _   14  _   _   NER=O
> 15    's  be  PART    POS _   14  _   _   NER=O
< 2 'd  would   AUX MD  VerbForm=Fin    1   _   _   ForwardAlignment=2|BackwardAlignment=2|NER=O
> 2 'd  have    AUX MD  VerbForm=Fin    1   _   _   ForwardAlignment=2|BackwardAlignment=2|NER=O
< 30    I   i   NUM CD  NumForm=Roman|NumType=Card  29  _   _   ForwardAlignment=30|BackwardAlignment=30|NER=O
> 30    I   I   NUM CD  NumForm=Roman|NumType=Card  29  _   _   ForwardAlignment=30|BackwardAlignment=30|NER=O
< 19    being   being   NOUN    NN  Number=Sing 18  _   _   ForwardAlignment=10|NER=O|SpaceAfter=No
> 19    being   be  NOUN    NN  Number=Sing 18  _   _   ForwardAlignment=10|NER=O|SpaceAfter=No

perayson commented 11 months ago

Thanks for spotting these. It will be spacy producing different lemmas than stanza in these instances. I expect you'd prefer the output lemma to be the same as the input lemma? As with the POS columns, I feel that this renders the output inconsistent in terms of the decisions made for the semantic tagger, but I can understand why you want to keep things consistent with the non semantically tagged version, so I will change it if you prefer.

perayson commented 11 months ago

I've now rerun the LV data and shared it for you to check again. In parallel, I'll start tagging some more files ...

TomazErjavec commented 11 months ago

I expect you'd prefer the output lemma to be the same as the input lemma?

Yes, that would be nice.

As with the POS columns, I feel that this renders the output inconsistent in terms of the decisions made for the semantic tagger.

If you would like to keep the Spacy PoS + lemma, you could put them in the MISC column (where you put the semantic annotaitons), like e.g. SpacyPoS=xxx|SpacyLemma=yyy. However, neither xxx nor yyy should contain a | for obvious reasons, which could be a problem.

I've now rerun the LV data and shared it for you to check again.

Thanks; as before, they look ok to me, but @matyaskopp has the eagle eye!

matyaskopp commented 11 months ago

If you would like to keep the Spacy PoS + lemma, you could put them in the MISC column (where you put the semantic annotaitons), like e.g. SpacyPoS=xxx|SpacyLemma=yyy. However, neither xxx nor yyy should contain a | for obvious reasons, which could be a problem.

I like the idea of adding Spacy tagging to MISC column if it is not a lot of work. You can also add xPos, so the fields can be: SpacyUPoS=xxx|SpacyXPoS=yyy|SpacyLemma=zzz. Note that this information will be only in conllu files, we are not able to propagate it to TEI without adding new TEI features.

Thanks; as before, they look ok to me, but @matyaskopp has the eagle eye!

The sample seems to be ok to me, thanks!

perayson commented 11 months ago

Thanks, that's a great idea to add these values into MISC. And including this in the conllu files only rather than extending TEI is fine by me of course. I've tweaked the script and rerun it over the LV data one more time and this is now available to check.

matyaskopp commented 11 months ago

@perayson, thanks for the changes. It is ok.

perayson commented 11 months ago

Great, I'm making progress through the files now. Just to check, in a folder of a file you've shared ParlaMint-NL-en.conllu/2015 there are some zero length files. Is this a bug?

matyaskopp commented 11 months ago

Great, I'm making progress through the files now. Just to check, in a folder of a file you've shared ParlaMint-NL-en.conllu/2015 there are some zero length files. Is this a bug?

No, these files contains only comment section and no discussion, eg this one:

<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" ana="#parla.sitting #reference" xml:lang="nl" xml:id="ParlaMint-NL_2015-09-22-tweedekamer-8">
<!-- SKIPPING HEADER -->
   <text ana="#reference" xml:lang="nl">
      <body>
         <div type="commentSection">
            <head>Beëdiging mevrouw J.G.P. Vermue</head>
            <note type="comment">De voorzitter:</note>
            <note type="comment">Ik geef het woord aan mevrouw Neppérus tot het uitbrengen van verslag namens de commissie voor het onderzoek van de Geloofsbrieven.</note>
            <note type="comment">Mevrouw, voorzitter der commissie:</note>
            <note type="comment">Voorzitter. De commissie voor het onderzoek van de Geloofsbrieven heeft de stukken onderzocht die betrekking hebben op mevrouw J.G.P. Vermue te Schoondijke.</note>
            <note type="comment">De commissie is eenstemmig tot de conclusie gekomen dat mevrouw J.G.P. Vermue te Schoondijke terecht benoemd is verklaard tot lid van de Tweede Kamer der Staten-Generaal.</note>
            <note type="comment">De commissie stelt u daarom voor om haar toe te laten als lid van de Kamer. Daartoe dient zij wel eerst de verklaringen en beloften, zoals die zijn voorgeschreven bij de Wet beëdiging ministers en leden Staten-Generaal van 27 februari 1992, Staatsblad nr. 120, af te leggen.</note>
            <note type="comment">De commissie verzoekt u tot slot om de Kamer voor te stellen, het volledige rapport in de Handelingen op te nemen.</note>
            <note type="comment">De voorzitter:</note>
            <note type="comment">Ik dank namens de Kamer de commissie voor haar verslag en stel voor, dienovereenkomstig te besluiten.</note>
            <note type="comment">Daartoe wordt besloten.</note>
            <note type="comment">Het rapport is opgenomen aan het eind van deze editie.</note>
            <note type="comment">De voorzitter:</note>
            <note type="comment">Ik verzoek de leden en de overige aanwezigen in de zaal en op de publieke tribune, voor zover dat mogelijk is, te gaan staan.</note>
            <note type="comment">Mevrouw Vermue is in het gebouw der Kamer aanwezig om de voorgeschreven verklaringen en beloften af te leggen.</note>
            <note type="comment">Ik verzoek de griffier, haar binnen te leiden.</note>
            <note type="comment">Nadat mevrouw Vermue door de griffier is binnengeleid, legt zij in handen van de voorzitter de bij de wet voorgeschreven verklaringen en beloften af.</note>
            <note type="comment">De voorzitter:</note>
            <note type="comment">Mevrouw Vermue, ik wens u van harte geluk met het lidmaatschap van de Kamer. Ik geef u nu de hand.</note>
            <note type="comment">Omdat we de presentielijst niet meer met de hand invullen, kan ik aan de Kamer melden dat uw naam elektronisch op de presentielijst zal worden verwerkt, want zo doen we het tegenwoordig! Ik verzoek u om in ons midden plaats te nemen. Na de stemmingen — want u kunt gelijk aan het werk met stemmen — is er gelegenheid tot feliciteren van onze nieuwe collega.</note>
            <note type="comment">Applaus</note>
         </div>
      </body>
   </text>
</TEI>

perayson commented 11 months ago

Ok, we'll return a zero length file as well then for these

perayson commented 11 months ago

I've been running further files and putting the results in https://ucrel.lancs.ac.uk/paul/parlamint/PyMUSASTagged/ as I go. There was an issue in one of the BA 2006 files which I'll need to investigate further. Earlier in the thread, @matyaskopp @TomazErjavec you mentioned that some files in https://nl.ijs.si/et/tmp/ParlaMint/MT/CoNLL-U-en/ might still be updated. Can you let me know if/when that might be? Also, hereby to introduce @JohnVidler who will be shepherding the remaining data through the semantic tagger pipeline next month ...

TomazErjavec commented 11 months ago

Great that the annotation is proceeding smoothly! As for the files that will still change: AT, CZ, UA, HU, TR. I think AT is finished, CZ should be soon, UA probably also, like in the next weeek. HU, TR too but, if not, I will hassle them. I would not like to received changes after the 15th at the latest. We might also receive 2 new corpora: FI, LT, also by the 15th.

perayson commented 11 months ago

Thanks, that's useful to know, so we'll avoid processing those five for now, and watch for updated timestamps on your server ...

JohnVidler commented 11 months ago

Hello 😀

I'm not quite on contract for this yet, but @perayson has added me here a little early so I can have a poke around.

TomazErjavec commented 11 months ago

Hello @JohnVidler! Good that you will join us. I just sent you an invite to ParlaMint@GitHub, so you will be properly a member of the team.

clarin-eric / ParlaMint

Annotating words with USAS #204