PL: many transcriber comments not annotated

ajdapretnar commented 3 years ago

For the Polish data set, if looking at top keywords, one gets dzwonek (bell) and oklaski (applause). This should not be included in the top keywords, because these are audio notations, not an actual part of the speech. Dzwonek aslo often gets tagged as a named entity.

matyaskopp commented 3 years ago

I probably don't understand the point of this issue...

PL corpus is correctly encoded: https://github.com/clarin-eric/ParlaMint/blob/3c8ad8aeab6d854cdd5e9113115b944e37d7e6d9/ParlaMint-PL/ParlaMint-PL_2018-09-27-senat-65-2.ana.xml#L396-L403

In this case, kinesic is within the u (speech element) because it happens during the speech. Everything that has been said by a speaker is in the seg element.

So if you want to get the top keywords, you can just look into //u/seg elements. Or if it is exact words, you can look for w element in the annotated version of the corpus (or look for the //w/@lemma attribute for fusional languages).

Do you have an example of incorrect encoding?

ajdapretnar commented 3 years ago

Perhaps NoSketch is not properly recognizing the tags then?

These are the outstanding words for the COVID period for regular MPs.

Then looking at the concordances of "dzwonek". I am not sure whether in this case "Dzwonek" are also people and places, but it looks incorrect (a Polish student confirmed this).

matyaskopp commented 3 years ago

Ok, so it is not a problem with data. But It is problem with representation in NoSketch. I believe that it will be solved with #83, @TomazErjavec, Am I right?

ajdapretnar commented 3 years ago

How about the named entities? The "PER:" and "LOC:" tags for "dzwonek"? Do they make sense in the original? Or might this be an issue?

matyaskopp commented 3 years ago

Oh, I have not studied the screenshot carefully - It is really weird - it should not happen!

Every named entity should contain at least one token.

TomazErjavec commented 3 years ago

Ok, so it is not a problem with data. But It is problem with representation in NoSketch. I believe that it will be solved with #83, @TomazErjavec, Am I right?

Not quite: currently, the contents of incidents (represented in vert/noSkE as <note>) are indeed encoded in vert/noSkE as 1 token, however, these tokens a) are bracketed (so e..g "[Dzwonek]" and b) without annotations, i.e. they do not get lemma, pos, etc., and neither should they be included in <name>/NER tags. So, all the examples of "Dzwonek" that @ajdapretnar has shown above in the concordances, ara part of regular text. At the same time, some (well, most) do appear inside incidents, ie. are correctly encoded.

These are the stats:

$ cat ParlaMint-PL.vert/*.vert | fgrep -c Dzwonek
17904
$ cat ParlaMint-PL.vert/*.vert | fgrep -c '[Dzwonek]'
15603

So, most are ok, but by no means all.

TomazErjavec commented 3 years ago

OK, the summary is that PL has 2301 cases of "Dzwonek" as part of the text, when, presumably (?) they should be encoded as incidents. This is about 13% of all occurences of "Dzwonek", so not a negligible amount. Not sure why @matyaskopp removed the bug label, as this presumably is a bug. I guess the issue should remain open (even though its name is not really the best) in the hope that somebody fixes this in the fullness of time.

matyaskopp commented 3 years ago

I assumed that a "Dzwonek" is part of the speech (not just a note), but if it should be encoded as an incident, then it is a bug (placing the label back).

I was not able to trace it back to original source data to see how it is encoded in the source, because PL data do not precisely reference source. (This should be an issue for next releases - keeping back-references to source)

TomazErjavec commented 1 year ago

@mrudolf, don't forget to address this issue pls. And close or tell us to close when fixed.

mrudolf commented 1 year ago

Alas, pandemic sessions are apparently badly annotated by the Parliament, with many speakers missing.

We are now proof-reading all the sessions, but this will be finished in February. I hope it would be possible to update our corpora to the corrected version then.

TomazErjavec commented 1 year ago

This issue still persists in the (draft) 3.0: e.g. for the query "(, Dzwonek, )" there are 6,785 hits. However, to be fair, for all the ones I looked at "(Dzwonek)" appear in the middle of a sentence, so it is a bit complicated to do the correct annotattions.

Moving this to milestone 3.1 in the hope that @mrudolf might fix this then. And that we can then get PL re-MTed...

TomazErjavec commented 9 months ago

@mrudolf has not fixed this for 3.1, so moving to "future" milestone.

mrudolf commented 9 months ago

Alas, our proofreaders did not finish correcting that so I haven't rerun the annotation yet. Will there be 3.2?

TomazErjavec commented 9 months ago

Will there be 3.2?

Who knows... The project is ending now, so, unless there is somehow another, maybe not.

clarin-eric / ParlaMint

PL: many transcriber comments not annotated #84