calzada / PARLAMINT-ES-MC

2 stars 4 forks source link

better transcriber notes encoding #23

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

Currently, all notes are encoded in the same way: https://github.com/calzada/PARLAMINT-ES-MC/blob/7d8412564b9686b396376d33f5cd9befb009f3c4/ParlaMint.sample/ParlaMint-ES_2015-01-20-CD150120.xml#L120

This can be improved by "annotating" them - using the proper elements and attribute values:

<kinesic type="applause">
  <desc>Aplausos</desc>
</kinesic>

see: https://github.com/clarin-eric/ParlaMint/issues/696#issue-1765368729 and the documentation: https://clarin-eric.github.io/ParlaMint/#sec-comments

This implementation can be placed in cd2parmamint.xsl or in a separate script (then Makefile modification is needed)

calzada commented 1 year ago

Dear @matyaskopp , What I have done is a list with all the comments. I can try to classify them and then we could replace notes with more specific notes. So what I will do is I will send you the classified notes we find in PARLAMINT-ES and then you tell me how to proceed. What do you think? There are some notes that I do not know where to place. For instance:

1) there are incidents that combine people leaving the Chamber with people applauding: How would you classify this? That is when the incident combines different kinesic and possibly non kinesic actions. 2) There are notes that refer to documentation that MPs are discussing. How would you do this? 3) There are notes that quantify what is being said. That is they specify the number of what is being discussed. 4) There are notes that are simple noise 5) There are notes for abbreviations or organisation referencing.

Let me know what you think while I keep classifying. Best mc

matyaskopp commented 1 year ago

What I have done is a list with all the comments. I can try to classify them and then we could replace notes with more specific notes. So what I will do is I will send you the classified notes we find in PARLAMINT-ES and then you tell me how to proceed. What do you think?

The best is probably to create an ordered list of regex, with proper classification, eg:

/.*aplausos.*/i  kinesic  applause
/.*pausa.*/i  incident pause
/.*risas.*/  kinesic kinesic

@matyaskopp, @charlicruz or @rdelibanoc can then integrate it into cd2parmamint.xsl script this way:

  • there are incidents that combine people leaving the Chamber with people applauding: How would you classify this? That is when the incident combines different kinesic and possibly non kinesic actions.

classify it with one most frequent action

  • There are notes that refer to documentation that MPs are discussing. How would you do this?

Good point, you can use <note type="comment">...</note> or we can invent a new one.

  • There are notes that quantify what is being said. That is they specify the number of what is being discussed.

I think the default <note type="comment">...</note> fits

  • There are notes that are simple noise

it depends whether it is a vocal or other noise:

<vocal type="noise">vocal noise (eg Rumores)</note>
<kinesic type="noise">non vocal noise (eg golpe de la puerta <!-- note:I invent this one, not in transcription -->)</note>
  • There are notes for abbreviations or organisation referencing.

I think the default <note type="comment">...</note> fits

rdelibanoc commented 1 year ago

Dear Matyas, we are about to finish a basic shell script with Perl oneliners such as this one:

perl -pi -e 's/<note>((.omienza|.mpieza|.inaliza|.ontinúa).+?en.+(euskera|catalán|gallego|bable|valenciano|hebreo).*?)<\/note>/<note type=“language-clarification">$1</note> /g' *.xml

Shall we send you the whole script as soon as we have it, or do you just need regex on their own?

When working on this we realised that we have a couple of questions:

  1. We have notes that mix kinesic and vocal cases, what do we do here? for instance: Aplausos, Rumores y risas (applauses, murmuring and laughter)
  2. We're going to use for many different situations. These will need further refinement. But we cannot tackle this now. We will document this, so you're aware of that.

This is it for the time being. best @matyaskopp

calzada commented 1 year ago

@matyaskopp, for notes like this:

<note>risas, rumores y aplausos</note> (laughter, murmuring, laughter), can we do something like this:

<kinesic type="mixed">
 <desc>risas, rumores y aplausos'</desc>
</kinesic>

Best, mc

matyaskopp commented 1 year ago

No new value! The closed list of legal values is here: https://clarin-eric.github.io/ParlaMint/#TEI.kinesic

there are two solutions:

  1. choose one "action", that classifies whole note:
    <kinesic type="applause">
    <desc>risas, rumores y aplausos'</desc>
    </kinesic>
  2. skip the type attribute, when mixed content
    <kinesic>
    <desc>risas, rumores y aplausos'</desc>
    </kinesic>
matyaskopp commented 1 year ago

And please, do it simply, we want to solve this soon. I suggest classifying notes with the most frequent words, and the rest just ignore (leave it as it is) and do it in the next ParlaMint project...

calzada commented 1 year ago

@matyaskopp , DO NOT WORRY. We will be finishing it by tomorrow st the latest. But just explain the following step. We run our script to Parlamint and then you change de cd2parliamint? What do you need for us to help you add the information in cd2parliamint? Best, mc

matyaskopp commented 1 year ago

@calzada if you will have a working script, please add it to the repository and I will integrate it to a makefile

calzada commented 1 year ago

Excellent. Matyas, have I thank you for your help? You are really great. Best for now, mc

El vie, 28 jul 2023 a las 15:12, Matyáš Kopp @.***>) escribió:

@calzada https://github.com/calzada if you will have a working script, please add it to the repository and I will integrate it to a makefile

— Reply to this email directly, view it on GitHub https://github.com/calzada/PARLAMINT-ES-MC/issues/23#issuecomment-1655662995, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AREWZ35UUOMXV6P5VSBDXSO3DDANCNFSM6AAAAAA2IO25JQ . You are receiving this because you were mentioned.Message ID: @.***>

calzada commented 1 year ago

@matyaskopp, @rdelibanoc, @TomazErjavec, @MonicaAlbini, @charlicr

PARLAMINT-02 directory contains scripts and all files for ParlaMint but with refined notes. @matyaskopp, Could you check that this is alright? If you are going to add scripts to "makefile" you should add scripts in order (note-fixing-script-01.sh, note-fixing-script-02.sh, note-fixing-script-03.sh, note-fixing-script-04.sh and note-fixing-script-05.sh).

matyaskopp commented 1 year ago

The implementation with the inline regex script is left-handed, but it works for now. It can be easily broken with different XML indentations.

calzada commented 1 year ago

What do you mean exactly by left-handed. Is there anything that needs doing? Best mc "w

El mar, 1 ago 2023 a las 23:27, Matyáš Kopp @.***>) escribió:

The implementation with the inline regex script is left-handed, but it works for now. It can be easily broken with different XML indentations.

— Reply to this email directly, view it on GitHub https://github.com/calzada/PARLAMINT-ES-MC/issues/23#issuecomment-1661124252, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AREVDBN5HWAGDCZVQDA3XTFYFVANCNFSM6AAAAAA2IO25JQ . You are receiving this because you were mentioned.Message ID: @.***>

matyaskopp commented 1 year ago

Not now. It is working. But it is easily breakable because XML files are not parsed when you process them. You have to be aware of that.

You treat XML files like any other text file, which is not safe. You produced an invalid XML file and you haven't noticed that because you are not using XML , so I had to fix(https://github.com/calzada/PARLAMINT-ES-MC/commit/1bb027709d08a7674184f4029fa6417d88f0f0f8) your scripts. eg this line: https://github.com/calzada/PARLAMINT-ES-MC/commit/1bb027709d08a7674184f4029fa6417d88f0f0f8#diff-d0f22ad3a0f0ba6daf129c5ee75af12e0652dcab24697c06662f0579d72ffdd2L31 where you forget to add <

You process XML file line by line, so if someone adds a new line inside the note, your script will not work.

The best solution is to open XML using some XML library (eg https://metacpan.org/pod/XML::LibXML).

TomazErjavec commented 1 year ago

I don't want to muddy the waters at this late stage but I am surprised that Perl one-liners are used for these fixes, rather than upgrading cd2parmamint.xsl. There, e.g. there is even a note to this effect, and it would not be difficult to add more fine grained distinctions for notes, just by checking the contents of note: https://github.com/calzada/PARLAMINT-ES-MC/blob/131d5d0581f2f164a1e9b5f3030e106160cdb16f/bin/cd2parmamint.xsl#L397-L400

Using XSLT for processing XML is much safer than, as @matyaskopp, just treating XML as text, where any new line or extra will cause problems.

matyaskopp commented 1 year ago

I have just spent more than an hour fixing this bug https://github.com/calzada/PARLAMINT-ES-MC/commit/06ff23428f2c096c8e37d10b275aa975c910d23d

It is a nice example that a random change in an XML file is a recipe for disaster: https://github.com/calzada/PARLAMINT-ES-MC/commit/06ff23428f2c096c8e37d10b275aa975c910d23d#diff-b7186781a58846130d0b0bd0701368fef19dbd9e23d968460cd4ce2922ccf8baL19

calzada commented 1 year ago

Oh. Sorry Matyas. You are ever so right!!! Anything I can do?

El vie, 4 ago 2023, 20:59, Matyáš Kopp @.***> escribió:

I have just spent more than an hour fixing this bug 06ff234 https://github.com/calzada/PARLAMINT-ES-MC/commit/06ff23428f2c096c8e37d10b275aa975c910d23d

It is a nice example that a random change in an XML file is a recipe for disaster: 06ff234

diff-b7186781a58846130d0b0bd0701368fef19dbd9e23d968460cd4ce2922ccf8baL19

https://github.com/calzada/PARLAMINT-ES-MC/commit/06ff23428f2c096c8e37d10b275aa975c910d23d#diff-b7186781a58846130d0b0bd0701368fef19dbd9e23d968460cd4ce2922ccf8baL19

— Reply to this email directly, view it on GitHub https://github.com/calzada/PARLAMINT-ES-MC/issues/23#issuecomment-1666049633, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AREVMBHDMXT5GPHMR6KDXTVBCBANCNFSM6AAAAAA2IO25JQ . You are receiving this because you were mentioned.Message ID: @.***>