Closed TomazErjavec closed 8 months ago
One option would be to directly mark the USAS tag in
w/@ana
, and, for MWEs, introduce a new element (probablyphr
) and markphr/@ana
. However, there is a real danger thatphr
will at times conflict withname
, leading to non-well formed XML or difficult fixes.
Agree
An alternative which does not have these problems is to use linkGrp, similarly to how we use it for syntax. Here the problem is that the link elements that we used so far inside linkGrp require at least two IDREFs as the value of their @target, but with USAS we will typically (except for MWEs) have only 1 IDREF. But this can be accommodated by using ptr instead of link (note that ptr/@targer can also have several IDREFs).
using ptr
is a nice hack but I have to admit that I don't like it either. This annoy me:
ptr/@ana
looks like you are anotating the pointer but you want to annotate word or mwe<ptr ana="usas:A13.3" target="#t3 #t4"/>
express that t3
and t4
form one mwe?I think the best solution for this situation is to introduce a <standOff>
annotations. I know that you did not want to introduce it before, but we will definitely introduce a kind of stand-off annotations with timelines and audio alignment, so it can probably solve this issue too.
<TEI>
<!-- ... -->
<standOff type="USAS-SEM">
<span ana="usas:Z8" target="#t1"/>
<span ana="usas:Z5" target="#t2"/>
<span ana="usas:A13.3" target="#t3 #t4"/>
<span ana="usas:Q2.2" target="#t5"/>
<span ana="usas:Z5" target="#t6"/>
<span ana="usas:G1.1" target="#t7"/>
<span ana="usas:X7p" target="#t8"/>
</standOff>
</TEI>
using ptr is a nice hack but I have to admit that I don't like it either.
I would say it's a tweak, not a hack, and I didn't actually say I don't like it - in fact, I do!
ptr/@ana
looks like you are anotating the pointer but you want to annotate word or mwe
But then you could say exactly the same for linkGrp/link/@ana
, which we use already for syntax.
Does this
express that t3 and t4 form one mwe?
Yes, exactly.
I think the best solution for this situation is to introduce standOff
Yikes! This would open a whole new can of worms:
standOff
needs its separate TEI document, with its teiHeader, so we would introdduce a completely new way of encoding linguistic annotations from all the restwe will definitely introduce a kind of stand-off annotations with timelines and audio alignment
Why? What is wrong with the Parla-CLARIN recommendation. Except that the description is rather brief... This is the way I encoded speech alignment in GosVL, cf. http://hdl.handle.net/11356/1444 and though to use the same system in ParlaMint. This is quite similar to what is proposed in the TEI-based ISO 24624:2016, although they do use annotationBlock to wrap elements.
I think I should point out that the example I sent to @TomazErjavec may have caused confusion. MWEs may have different semantic tags for each token within them, the example above just coincidentally happened to have 2 tokens tagged A13.3
.
Perhaps a separate <ptr>
could be used for the MWE and we add something to the taxonomy for it e.g.
<ptr ana="usas:A13.3" target="#t3"/>
<ptr ana="usas:A13.3" target="#t4"/>
<ptr ana="usas:MWE" target="#t3 t4"/>
But then you could say exactly the same for linkGrp/link/@ana, which we use already for syntax.
I don't think so. We annotate the link between two entities (not child or parent node) in the annotation of syntactic relation. But in USAS case we want to annotate the node (word or mwe) not the ptr.
Does this express that t3 and t4 form one mwe?
Yes, exactly.
How I understand this <ptr ana="usas:A13.3" target="#t3 #t4"/>
:
A13.3
t3
and t4
Yikes! This would open a whole new can of worms
- standOff needs its separate TEI document, with its teiHeader, so we would introdduce a completely new way of encoding linguistic annotations from all the rest
I don't think that standOff
needs a separate TEI document. It can be placed at the end of TEI document if it fulfils bold condition:
source: https://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SASOstdf
As a member of model.resource, standOff may occur as a child of TEI (or teiCorpus). If the metadata that describes the standOff is largely the same as the metadata that describes the associated resource (e.g., the transcribed text in text), then the standOff and the encoded associated resource may appear as children of the same TEI element. The example below has a transcription with <placename> elements in the text linked to a list of place elements in the standOff section.
- If we were to use it for semantic annotations, why not for syntax, and, in fact, all other linguistic annotations. Not saying this is a completely crazy idea, but it would mean redesigning the complete ling. annotation, I don't think we have the energy and time for that.
we will definitely introduce a kind of stand-off annotations with timelines and audio alignment
Why? What is wrong with the ...
sorry, this was a misleading comparison
I am not sure if we can agree now. I hope it will be brighter tomorrow.
I think I should point out that the example I sent to @TomazErjavec may have caused confusion. MWEs may have different semantic tags for each token within them, the example above just coincidentally happened to have 2 tokens tagged A13.3.
@matthewcoole I used your online tagger http://ucrel-api.lancaster.ac.uk/usas/tagger.html, and it made some think that I finally understand the point of mwe. you are annotating both:
I have tried this sentence: The Manx name of the Isle of Man is Ellan Vannin. And the vertical output is:
0000001 002 ----- -----
0000003 010 AT The Z5
0000003 020 JJ Manx Z99
0000003 030 NN1 name Q2.2
0000003 040 IO of Z5
0000003 050 AT the Z5
0000003 060 NNL1 Isle Z2[i1.3.1 W3
0000003 070 IO of Z2[i1.3.2 Z5
0000003 080 NP1 Man Z2[i1.3.3
0000003 090 VBZ is A3+ Z5
0000003 100 NP1 Ellan Z1mf[i2.2.1 Z3c[i2.2.1
0000003 110 NP1 Vannin Z1mf[i2.2.2 Z3c[i2.2.2
0000003 111 . .
We can decomposite this issue into two:
And then, we can annotate words in @ana
attribute: <w ana="usas:A3+ usas:Z5">is</w>
and mwes with <PSEUDO_MWE_ELEMENT ana="usas:..."/>
I think that makes sense. I should've been clearer. MWEs should have the same tags I think, but the tokens themselves may have other tags, as in the Isle of Man
example above. Correct me if I'm wrong @perayson .
OK, if I try to summarize, also to check if I understand:
As for the encoding:
All in all I somewhat prefer 1. above - it is the simplest in terms of encoding, and implementing a script that in one way or another fixes conflicts between name and phr should not be that hard - because any such conflicts mean that one annotation was in error, we do not spoil the annotations.
All in all I somewhat prefer 1. above - it is the simplest in terms of encoding, and implementing a script that in one way or another fixes conflicts between name and phr should not be that hard - because any such conflicts mean that one annotation was in error, we do not spoil the annotations.
Agree.
For the record, full sample of The Manx name of the Isle of Man is Ellan Vannin. sentence:
<s>
<w ana="usas:Z5">The</w>
<w ana="usas:Z99">Manx</w>
<w ana="usas:Q2.2">name</w>
<w ana="usas:Z5">of</w>
<w ana="usas:Z5">the</w>
<phr ana="usas:Z2">
<w ana="usas:W3">Isle</w>
<w ana="usas:Z5">of</w>
<w>Man</w>
</phr>
<w ana="usas:A3+ usas:Z5">is</w> <!-- pointing to invalid ID, but it is not the point of this example -->
<phr ana="usas:Z1mf usas:Z3c">
<w>Ellan</w>
<w>Vannin</w>
</phr>
<pc>.</pc>
</s>
I just answered something along similar lines in #202. The tagger itself will output all the possible semantic tags for each word and MWE, but following contextual disambiguation, the first tag in the list should be the most likely. So, we could simplify things for ParlaMint and just provide the first choice tag and remove the remainder. That will give us a certain level of accuracy, but reduce recall of course.
So, it is now settled that USAS annotation:
I would split this encoding question into two parts, how to encode USAS tags in CoNLL-U (relevant for @perayson ), and how in TEI, which will be generated from CoNLL-U (for @matyaskopp and @TomazErjavec).
In CoNLL-U, it is agreed that the USAS tags go into the MISC column. The simple way is to support only mark-up of individual words, but, as long as we have MWEs, we could mark this up as well (under the assumption that two MWEs never overlap). For this the standard was is to use the IOB encoding, which we already use for NER markup in CoNLL-U. This means that
SEM
SEM=O
SEM=B-<tags>
SEM=I-<tags>
So, we would have something like (with irrelevant columns skipped and randomly picked USAS tags):
# sent_id = ParlaMint-GB_2015-01-05-commons.seg1.2
# text = What progress her Department has made on implementing exit checks at borders.
1 What NER=O|SEM=O
2 progress NER=O|SEM=B-A10-,A12-
3 her NER=O|SEM=O
4 Department NER=O|SEM=B-A11.1+
5 has NER=O|SEM=O
6 made NER=O|SEM=B-A1.1.1,G2.2-,X9.2+,E3-,N5+,G2.1%,Z5
7 on NER=O|SEM=O
8 implementing NER=O|SEM=B-A1.1.1,H1%
9 exit NER=O|SEM=B-A11.1+,H2
10 checks NER=O|SEM=I-A11.1+,H2
11 at NER=O|SEM=O
12 borders NER=O|SpaceAfter=No|SEM=B-ZZ2
13 . NER=O|SEM=O
@matyaskopp, do you agree?
As for the TEI encoding, I would postpone this discussion until the CoNLL-U format is finalised.
- the tags are composed of alphanumeric characters and full stop. An inspection of tag sequences also shows usage of +, -, /, %, and space.
Tags do not contain spaces, spaces are tag separators, so they can be replaced with character that is not present in tag, e.g. ,
10 checks X2.4/A5.3 A15 Q1.2 T2- M5
can be encoded this way:
10 checks SEM=B-X2.4/A5.3,A15,Q1.2,T2-,M5
In CoNLL-U, it is agreed that the USAS tags go into the MISC column. The simple way is to support only mark-up of individual words, but as long as we have MWEs, we could mark this up as well (under the assumption that two MWEs never overlap).
True if "overlap" means that the tags on the second,... positions have the same "semantic segmentation". The second and latter tags describe the same span (mwe, or single token) as the first one
Otherwise, we have to add B-
and I-
prefixes to all tags:
9 exit NER=O|SEM=B-A11.1+,B-H1
10 checks NER=O|SEM=I-A11.1+,B-H2
Now I see my previous comment and example, it seems that there are nested semantic spans:
0000001 002 ----- -----
0000003 010 AT The Z5
0000003 020 JJ Manx Z99
0000003 030 NN1 name Q2.2
0000003 040 IO of Z5
0000003 050 AT the Z5
0000003 060 NNL1 Isle Z2[i1.3.1 W3
0000003 070 IO of Z2[i1.3.2 Z5
0000003 080 NP1 Man Z2[i1.3.3
0000003 090 VBZ is A3+ Z5
0000003 100 NP1 Ellan Z1mf[i2.2.1 Z3c[i2.2.1
0000003 110 NP1 Vannin Z1mf[i2.2.2 Z3c[i2.2.2
0000003 111 . .
@matyaskopp it looks like you either haven't seen or read what I wrote, because:
Tags do not contain spaces, spaces are tag separators, so they can be replaced with character that is not present in tag, e.g. ,
Yes, that is why I wrote "tag sequences", and also proposed substituting spaces with commas.
Otherwise, we have to add B- and I- prefixes to all tags:
I alredy proposed adding B and I prefixes to all the tags, we need them to be able to properly encode MWEs (even without nesting). We cannot count on two neighbouring words to always have different tags, also, we already encode NEs with IOB, so it is only sensible to encode semantic annotations in the same way.
it seems that there are nested semantic spans
If this is true (and maybe @perayson can confirm), then we do have a problem. One option would of course be to ignore the inner tags, the same way we do for NER (well, except CZ). If this is for some reason not possible, we have to think again...
@matyaskopp it looks like you either haven't seen or read what I wrote, because:
Tags do not contain spaces, spaces are tag separators, so they can be replaced with character that is not present in tag, e.g. ,
Yes, that is why I wrote "tag sequences", and also proposed substituting spaces with commas.
Sorry, I haven't understood it properly I imagined a vertical sequence of tags that correspond to multiple tokens. Now I see what you meant "tag sequence" is a list of "tag alternatives"
The format with the 10 digits at the start of each line comes from the C version of USAS, and the addition of sequences like [i1.3.2
was intended to show the MWE sequences: [i
is the separator, 1
= ID unique within the current file, 3
= this MWE is three words long, and 2
= this is the second element of the MWE sequence. PyMUSAS doesn't produce this format, see https://ucrel.github.io/pymusas/usage/how_to/tag_text instead for output examples. I think the BIO format is a good idea as long as we say that it represents information about the most likely (i.e. first) tag in the list. i.e. if the word is the part of the MWE, then we use B
and I
for internal MWE parts, otherwise O
. The remainder of the tag list might include other (non-preferred) MWE tags or single word tags (but we ignore the MWE information about the non-preferred tags).
Note that punctuation also gets a Z9
tag in wmatrix6, PyMUSAS itself (at the moment) outputs PUNCT
(which I should change). Function words will also get other Z
tags. SEM=O-X3.4
will indicate that X3.4
is a single word semantic tag.
PyMUSAS doesn't produce this format
OK, this is a relief!
see https://ucrel.github.io/pymusas/usage/how_to/tag_text instead for output examples
Is there maybe a document explaining the format? Namely, the only output example I understand there is for English, and that one is rather short. In particular I'm still not completely clear on whether all words inside a MWE have the same tags, or can they differ, so that the complete MWE has one tag, but there can be further word-specific tags in the tag list, so that in effects tags can be nested. By what you write below, my guess is that the second in the case, so I contnue with that understanding.
I think we then have two options:
1 use IOB for all sem tags, and always keep only the first tag. The fact that something is a MWE or not is distinguished by the I tag on the second, third etc. MWE token. As you write that all tokens get a sem tag, we would not in fact have any O tags 2 use two attributes, say 1) SEMMWE for MWEs, which uses IOB format, it then contains the first (MWE) tag only, if it is inside a MWE (B for first, I for internal), and O otherwise, and (2) a SEM tag for single word tag lists, which does not need to be IOB, but just gives a list of tags for each token (without the MWE tag)
So, let's say we have a sentence "a b c", with b and c being a MWE. a has sem tags 1,2, b has 3, c has 4, while the MWE tag is M, so that b has the list "M,3" and c has "M,4".
So, the first option would give:
a SEM=B-1
b SEM=B-M
c SEM=I-M
and the second would be:
a SEMMWE=O SEM=1,2
b SEMMWE=B-M SEM=3
c SEMMWE=I-M SEM=4
I hope it is more or less clear what I mean....
I am slightly in favour of the first, as it is simpler, but the second does not lose any information. Thoughts?
Note that punctuation also gets a Z9 tag in wmatrix6, PyMUSAS itself (at the moment) outputs PUNCT (which I should change).
OK, so I guess this means that all tokens get a sem tag, nice.
I am slightly in favour of the first, as it is simpler, but the second does not lose any information. Thoughts?
I see another option:
3 use IOB for all sem tags (O
will never be used if all tokens are tagged). Add I-
or B-
prefix to all tags and tags separate with a comma:
a SEM=B-1,B-2
b SEM=B-M,B-3
c SEM=I-M,B-4
it is simple encoding and the MWE will never overlap with the different MWE. You can always get the first option from it: s/,.*$//
OK, so I guess this means that all tokens get a sem tag, nice.
yes because we don't have syntactic analysis for the translated version (so no syntactic tokens).
I see another option: use IOB for all sem tags (O will never be used if all tokens are tagged). Add I- or B- prefix to all tags and tags separate with a comma
Wow, good one! It is a slight perversion over the usual IOB rules but I think it solves all the problems, so I also vote for it!
I'm not sure that I understand option 3 because it makes sense to me that a word which is not part of a MWE should be labelled O
(not B
). To answer an earlier question, for semantic MWEs, each word has the same semantic field tag. So, for those, I would use B
for the first part, and I
for the rest. My preferred option is to keep the tags and MWE markers separate (as per the current pymusas output), so here is option 4:
a SEMMWE=O SEM=1,2
b SEMMWE=B SEM=M,3
c SEMMWE=I SEM=M,4
I'm not sure that I understand option 3 because it makes sense to me that a word which is not part of a MWE should be labelled O (not B).
The idea is that there is no formal distinction between single and multi word expressions, the only difference is that tags for single words will always be B, while for MWEs the second, third etc. word will be I. And, as every token gets a semantic tag, there are no tokens oustide some semantic tag, so nothing is marked with O.
As for your suggestion, the use of IOB here is different from what it is otherwise, e.g. in our NER annotation. If you want to split MWE annotations from single word annotations, a more common way would be:
a SEMMWE=O SEM=1,2
b SEMMWE=B-M SEM=3
c SEMMWE=I-M SEM=4
Still, not to overcomplicate: why don't you do it in the way that you feel most comfortable with, and then me and @matyaskopp can, if necessary, modify it to be the most in line with our NER annotation?
Thanks, so I've tweaked the tagging script to produce this format:
# sent_id = ParlaMint-ES-CT_2022-01-25-2201.1.0.8.1
# source = Sí; senyor Garriga ?
# text = Yes, Mr. Garriga?
1 Yes yes INTJ UH _ 0 _ _ ForwardAlignment=1|BackwardAlignment=1|NER=O|SpaceAfter=No|SEMMWE=O|SEM=Z4
2 , , PUNCT , _ 1 _ _ ForwardAlignment=2|BackwardAlignment=2|NER=O|SEMMWE=O|SEM=Z9
3 Mr. Mr. PROPN NNP Number=Sing 2 _ _ ForwardAlignment=3|BackwardAlignment=3|NER=O|SEMMWE=B|SEM=Z1mf,Z3c
4 Garriga Garriga PROPN NNP Number=Sing 3 _ _ ForwardAlignment=4|BackwardAlignment=4|NER=B-PER|SpaceAfter=No|SEMMWE=I|SEM=Z1mf,Z3c
5 ? ? PUNCT . _ 4 _ _ ForwardAlignment=5|BackwardAlignment=5|NER=O|SEMMWE=O|SEM=Z9
If you approve, then I can start to set up the tagging jobs. By the way, please can you confirm where I should download the final MT CONLLU format data from?
If you approve, then I can start to set up the tagging jobs.
Agree with this format
By the way, please can you confirm where I should download the final MT CONLLU format data from?
You can download it directly from the repository (@TomazErjavec please confirm):
Kuzman, Taja; et al., 2023, Linguistically annotated multilingual comparable corpora of parliamentary debates in English ParlaMint-en.ana 3.0, Slovenian language resource repository CLARIN.SI, ISSN 2820-4042, http://hdl.handle.net/11356/1810.
Note that there will be changes/updates and newly added corpora. We will send them once they are ready (partners will deliver data and @TajaKuzman will translate it):
So please start with the stable corpora.
Thanks, I will make a start on the first one next week, and then I'll have a better estimate of running times
Yes, http://hdl.handle.net/11356/1810 has the official files for 3.0. But you also have them at https://nl.ijs.si/et/tmp/ParlaMint/MT/CoNLL-U-en/ Maybe better as a rule take them from there, in case any changed (AT has been already) or are added. Once you have even a few files, if you could send a sample, so we try implementing the conll2tei2vertical pipelne. For sending the files, best would be a http server with tar.gz files, or anything else that wget understands.. Also, what about ParlaMint-GB? That one obviosly doesn't have a translation, so I guess we take the originals. Which are of course sligtly differen but nowhere critical I think.
Thanks, I've tagged LV first, and the resulting tgz file is here: https://ucrel.lancs.ac.uk/paul/parlamint/PyMUSASTagged/ I'll keep going with countries not in @matyaskopp's list of six, and taking files from your nl.ijs.si server. If you can check the format, and confirm you're happy for it as input to you conll2tei2vertical pipeline, then I'll parallelise further to speed up. Last night, it looks like approximately 3M words/hour. Re ParlaMint-GB, I could take those, but do you want to harmonise to the same CONLLU format first?
Thanks, I've dowloaded the file and had a peek. At first glance it looks ok, except for a few details:
# sent_id = ParlaMint-LV_2021-01-07-PT13-2173-U1-P1.1
# source = Labrīt, godātie deputāti!
# text = Good morning, honourable Members!
The format should be as in the originals, i.e. exactly one space on each side of the equals sign.
-pymusas
, as the pipeline is set up for the original names.More later, if anything else turns out, but I think it is all ok, except if @matyaskopp disagrees.
For speed, I hope it works out, the complete corpus is 1,3 billion words. As for GB, no, we won't be harmonising the CoNLL-U, as the way GB has it is the way all original corpora have it. But all the relevant bits for you are the same in both cases, so I hope that won't be a problem.
I have one note: USAS files contain different pos
and xtag
, not sure if we agreed on that
USAS files contain different pos and xtag
Well spotted! Indeed, we haven't agreed on that, and at least UPOS tag should be as in the original. Maybe XPOS too.
Noticed one more thing, extra quotes which should be removed:
MT input: # text = What, please, is "for"?
USAS output: # text = "What, please, is ""for""?"
Noticed one more thing, extra quotes which should be removed: MT input:
# text = What, please, is "for"?
USAS output:# text = "What, please, is ""for""?"
Confirming illustration:
Many thanks both for spotting all these, the issues hadn't emerged from the smaller tests I'd done so far. I should be able to fix the tabs and the filenames.
For the UPOS and XPOS tags, these are emerging from spacy's pipeline and influence the semtag decisions, whereas I think you said the originals were coming from Stanza? I'd argue to keep the spacy versions since the stanza ones will already be in the released data, and this is the semtag pipeline. If you insist on the stanza ones, I should be able to swap them back in.
I'll check on the extra quotes, I think that's due to the csv writer I'm using. I'll wait for the slightly larger PT set to run today, do you want to look at those too? And then I'll update the script tomorrow and share revised versions.
For the UPOS and XPOS tags, these are emerging from spacy's pipeline and influence the semtag decisions, whereas I think you said the originals were coming from Stanza? I'd argue to keep the spacy versions since the stanza ones will already be in the released data, and this is the semtag pipeline. If you insist on the stanza ones, I should be able to swap them back in.
Thinks that should be considered:
upos
and xpos
set? I am slightly in favour of annotating everything with Stanza if it is okay for USAS. (and ask @nljubesi to annotate ParlaMint-GB with Stanza).
Thanks, @matyaskopp for the analysis. I agree, even though I doubt that @nljubesi would have the time to annotate GB.
I'd argue to keep the spacy versions since the stanza ones will already be in the released data, and this is the semtag pipeline. If you insist on the stanza ones, I should be able to swap them back in.
My idea was that the stanza ones with added USAS will be in the released data, as otherwise we cannot include USAS in the release CoNLL-U files, which would be a shame. So, yes, if you could swap them back in that would be great.
In the meantime I enhanced cp-conllu.pl to be able to deal with the extraneous quotes, so you don't actually have to bother with that.
Hi both, I've now fixed the tabs, kept the original filenames, retained the original UPOS and XPOS columns, removed the quoted quotes, and rerun the LV data. The result is here again: https://ucrel.lancs.ac.uk/paul/parlamint/PyMUSASTagged/ for you to check and confirm. Thanks!
Thanks, checked, looks good to me! Pls. feel free to start with the other corpora.
@perayson, I have spotted one more issue, you have changed the lemma:
Other examples of lemma differences: (I transformed your data back to input data - s/|SEMMWE.*$//
)
< ENGLISH
> USAS
< 8 DID did VERB VBD Mood=Ind|Number=Sing|Person=2|Tense=Past|VerbForm=Fin 7 _ _ ForwardAlignment=9|BackwardAlignment=9|NER=O
> 8 DID do VERB VBD Mood=Ind|Number=Sing|Person=2|Tense=Past|VerbForm=Fin 7 _ _ ForwardAlignment=9|BackwardAlignment=9|NER=O
< 15 's 's PART POS _ 14 _ _ NER=O
> 15 's be PART POS _ 14 _ _ NER=O
< 2 'd would AUX MD VerbForm=Fin 1 _ _ ForwardAlignment=2|BackwardAlignment=2|NER=O
> 2 'd have AUX MD VerbForm=Fin 1 _ _ ForwardAlignment=2|BackwardAlignment=2|NER=O
< 30 I i NUM CD NumForm=Roman|NumType=Card 29 _ _ ForwardAlignment=30|BackwardAlignment=30|NER=O
> 30 I I NUM CD NumForm=Roman|NumType=Card 29 _ _ ForwardAlignment=30|BackwardAlignment=30|NER=O
< 19 being being NOUN NN Number=Sing 18 _ _ ForwardAlignment=10|NER=O|SpaceAfter=No
> 19 being be NOUN NN Number=Sing 18 _ _ ForwardAlignment=10|NER=O|SpaceAfter=No
Thanks for spotting these. It will be spacy producing different lemmas than stanza in these instances. I expect you'd prefer the output lemma to be the same as the input lemma? As with the POS columns, I feel that this renders the output inconsistent in terms of the decisions made for the semantic tagger, but I can understand why you want to keep things consistent with the non semantically tagged version, so I will change it if you prefer.
I've now rerun the LV data and shared it for you to check again. In parallel, I'll start tagging some more files ...
I expect you'd prefer the output lemma to be the same as the input lemma?
Yes, that would be nice.
As with the POS columns, I feel that this renders the output inconsistent in terms of the decisions made for the semantic tagger.
If you would like to keep the Spacy PoS + lemma, you could put them in the MISC column (where you put the semantic annotaitons), like e.g. SpacyPoS=xxx|SpacyLemma=yyy. However, neither xxx nor yyy should contain a | for obvious reasons, which could be a problem.
I've now rerun the LV data and shared it for you to check again.
Thanks; as before, they look ok to me, but @matyaskopp has the eagle eye!
If you would like to keep the Spacy PoS + lemma, you could put them in the MISC column (where you put the semantic annotaitons), like e.g. SpacyPoS=xxx|SpacyLemma=yyy. However, neither xxx nor yyy should contain a | for obvious reasons, which could be a problem.
I like the idea of adding Spacy tagging to MISC column if it is not a lot of work. You can also add xPos, so the fields can be: SpacyUPoS=xxx|SpacyXPoS=yyy|SpacyLemma=zzz
.
Note that this information will be only in conllu files, we are not able to propagate it to TEI without adding new TEI features.
Thanks; as before, they look ok to me, but @matyaskopp has the eagle eye!
The sample seems to be ok to me, thanks!
Thanks, that's a great idea to add these values into MISC. And including this in the conllu files only rather than extending TEI is fine by me of course. I've tweaked the script and rerun it over the LV data one more time and this is now available to check.
@perayson, thanks for the changes. It is ok.
Great, I'm making progress through the files now. Just to check, in a folder of a file you've shared ParlaMint-NL-en.conllu/2015
there are some zero length files. Is this a bug?
Great, I'm making progress through the files now. Just to check, in a folder of a file you've shared
ParlaMint-NL-en.conllu/2015
there are some zero length files. Is this a bug?
No, these files contains only comment section and no discussion, eg this one:
<?xml version="1.0" encoding="UTF-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0" ana="#parla.sitting #reference" xml:lang="nl" xml:id="ParlaMint-NL_2015-09-22-tweedekamer-8">
<!-- SKIPPING HEADER -->
<text ana="#reference" xml:lang="nl">
<body>
<div type="commentSection">
<head>Beëdiging mevrouw J.G.P. Vermue</head>
<note type="comment">De voorzitter:</note>
<note type="comment">Ik geef het woord aan mevrouw Neppérus tot het uitbrengen van verslag namens de commissie voor het onderzoek van de Geloofsbrieven.</note>
<note type="comment">Mevrouw, voorzitter der commissie:</note>
<note type="comment">Voorzitter. De commissie voor het onderzoek van de Geloofsbrieven heeft de stukken onderzocht die betrekking hebben op mevrouw J.G.P. Vermue te Schoondijke.</note>
<note type="comment">De commissie is eenstemmig tot de conclusie gekomen dat mevrouw J.G.P. Vermue te Schoondijke terecht benoemd is verklaard tot lid van de Tweede Kamer der Staten-Generaal.</note>
<note type="comment">De commissie stelt u daarom voor om haar toe te laten als lid van de Kamer. Daartoe dient zij wel eerst de verklaringen en beloften, zoals die zijn voorgeschreven bij de Wet beëdiging ministers en leden Staten-Generaal van 27 februari 1992, Staatsblad nr. 120, af te leggen.</note>
<note type="comment">De commissie verzoekt u tot slot om de Kamer voor te stellen, het volledige rapport in de Handelingen op te nemen.</note>
<note type="comment">De voorzitter:</note>
<note type="comment">Ik dank namens de Kamer de commissie voor haar verslag en stel voor, dienovereenkomstig te besluiten.</note>
<note type="comment">Daartoe wordt besloten.</note>
<note type="comment">Het rapport is opgenomen aan het eind van deze editie.</note>
<note type="comment">De voorzitter:</note>
<note type="comment">Ik verzoek de leden en de overige aanwezigen in de zaal en op de publieke tribune, voor zover dat mogelijk is, te gaan staan.</note>
<note type="comment">Mevrouw Vermue is in het gebouw der Kamer aanwezig om de voorgeschreven verklaringen en beloften af te leggen.</note>
<note type="comment">Ik verzoek de griffier, haar binnen te leiden.</note>
<note type="comment">Nadat mevrouw Vermue door de griffier is binnengeleid, legt zij in handen van de voorzitter de bij de wet voorgeschreven verklaringen en beloften af.</note>
<note type="comment">De voorzitter:</note>
<note type="comment">Mevrouw Vermue, ik wens u van harte geluk met het lidmaatschap van de Kamer. Ik geef u nu de hand.</note>
<note type="comment">Omdat we de presentielijst niet meer met de hand invullen, kan ik aan de Kamer melden dat uw naam elektronisch op de presentielijst zal worden verwerkt, want zo doen we het tegenwoordig! Ik verzoek u om in ons midden plaats te nemen. Na de stemmingen — want u kunt gelijk aan het werk met stemmen — is er gelegenheid tot feliciteren van onze nieuwe collega.</note>
<note type="comment">Applaus</note>
</div>
</body>
</text>
</TEI>
Ok, we'll return a zero length file as well then for these
I've been running further files and putting the results in https://ucrel.lancs.ac.uk/paul/parlamint/PyMUSASTagged/ as I go. There was an issue in one of the BA
2006
files which I'll need to investigate further. Earlier in the thread, @matyaskopp @TomazErjavec you mentioned that some files in https://nl.ijs.si/et/tmp/ParlaMint/MT/CoNLL-U-en/ might still be updated. Can you let me know if/when that might be? Also, hereby to introduce @JohnVidler who will be shepherding the remaining data through the semantic tagger pipeline next month ...
Great that the annotation is proceeding smoothly! As for the files that will still change: AT, CZ, UA, HU, TR. I think AT is finished, CZ should be soon, UA probably also, like in the next weeek. HU, TR too but, if not, I will hassle them. I would not like to received changes after the 15th at the latest. We might also receive 2 new corpora: FI, LT, also by the 15th.
Thanks, that's useful to know, so we'll avoid processing those five for now, and watch for updated timestamps on your server ...
Hello 😀
I'm not quite on contract for this yet, but @perayson has added me here a little early so I can have a poke around.
Hello @JohnVidler! Good that you will join us. I just sent you an invite to ParlaMint@GitHub, so you will be properly a member of the team.
The USAS semantic tags will be encoded in a taxonomy (cf. #202), but there remains the question of how to encode these tags (or, rather, references to the IDs of the taxomomy categories) on word tokens. An important complication is that USAS can also tag multi-word expressions (MWEs).
One option would be to directly mark the USAS tag in
w/@ana
, and, for MWEs, introduce a new element (probablyphr
) and markphr/@ana
. However, there is a real danger thatphr
will at times conflict withname
, leading to non-well formed XML or difficult fixes.An alternative which does not have these problems is to use
linkGrp
, similarly to how we use it for syntax. Here the problem is that thelink
elements that we used so far insidelinkGrp
require at least two IDREFs as the value of their@target
, but with USAS we will typically (except for MWEs) have only 1 IDREF. But this can be accommodated by usingptr
instead oflink
(note thatptr/@targer
can also have several IDREFs).In line with this, the encoding (suitably simplified) could be like:
@matyaskopp, do you see any problems with this suggestion?