Closed matyaskopp closed 1 year ago
Dear @matyaskopp , What I have done is a list with all the comments. I can try to classify them and then we could replace notes with more specific notes. So what I will do is I will send you the classified notes we find in PARLAMINT-ES and then you tell me how to proceed. What do you think? There are some notes that I do not know where to place. For instance:
1) there are incidents that combine people leaving the Chamber with people applauding: How would you classify this? That is when the incident combines different kinesic and possibly non kinesic actions. 2) There are notes that refer to documentation that MPs are discussing. How would you do this? 3) There are notes that quantify what is being said. That is they specify the number of what is being discussed. 4) There are notes that are simple noise 5) There are notes for abbreviations or organisation referencing.
Let me know what you think while I keep classifying. Best mc
What I have done is a list with all the comments. I can try to classify them and then we could replace notes with more specific notes. So what I will do is I will send you the classified notes we find in PARLAMINT-ES and then you tell me how to proceed. What do you think?
The best is probably to create an ordered list of regex, with proper classification, eg:
/.*aplausos.*/i kinesic applause
/.*pausa.*/i incident pause
/.*risas.*/ kinesic kinesic
@matyaskopp, @charlicruz or @rdelibanoc can then integrate it into cd2parmamint.xsl script this way:
<note type="comment">...</note>
- there are incidents that combine people leaving the Chamber with people applauding: How would you classify this? That is when the incident combines different kinesic and possibly non kinesic actions.
classify it with one most frequent action
- There are notes that refer to documentation that MPs are discussing. How would you do this?
Good point, you can use <note type="comment">...</note>
or we can invent a new one.
- There are notes that quantify what is being said. That is they specify the number of what is being discussed.
I think the default <note type="comment">...</note>
fits
- There are notes that are simple noise
it depends whether it is a vocal or other noise:
<vocal type="noise">vocal noise (eg Rumores)</note>
<kinesic type="noise">non vocal noise (eg golpe de la puerta <!-- note:I invent this one, not in transcription -->)</note>
- There are notes for abbreviations or organisation referencing.
I think the default <note type="comment">...</note>
fits
Dear Matyas, we are about to finish a basic shell script with Perl oneliners such as this one:
perl -pi -e 's/<note>((.omienza|.mpieza|.inaliza|.ontinúa).+?en.+(euskera|catalán|gallego|bable|valenciano|hebreo).*?)<\/note>/<note type=“language-clarification">$1</note> /g' *.xml
Shall we send you the whole script as soon as we have it, or do you just need regex on their own?
When working on this we realised that we have a couple of questions:
This is it for the time being. best @matyaskopp
@matyaskopp, for notes like this:
<note>risas, rumores y aplausos</note> (laughter, murmuring, laughter), can we do something like this:
<kinesic type="mixed">
<desc>risas, rumores y aplausos'</desc>
</kinesic>
Best, mc
No new value! The closed list of legal values is here: https://clarin-eric.github.io/ParlaMint/#TEI.kinesic
there are two solutions:
<kinesic type="applause">
<desc>risas, rumores y aplausos'</desc>
</kinesic>
type
attribute, when mixed content
<kinesic>
<desc>risas, rumores y aplausos'</desc>
</kinesic>
And please, do it simply, we want to solve this soon. I suggest classifying notes with the most frequent words, and the rest just ignore (leave it as it is) and do it in the next ParlaMint project...
@matyaskopp , DO NOT WORRY. We will be finishing it by tomorrow st the latest. But just explain the following step. We run our script to Parlamint and then you change de cd2parliamint? What do you need for us to help you add the information in cd2parliamint? Best, mc
@calzada if you will have a working script, please add it to the repository and I will integrate it to a makefile
Excellent. Matyas, have I thank you for your help? You are really great. Best for now, mc
El vie, 28 jul 2023 a las 15:12, Matyáš Kopp @.***>) escribió:
@calzada https://github.com/calzada if you will have a working script, please add it to the repository and I will integrate it to a makefile
— Reply to this email directly, view it on GitHub https://github.com/calzada/PARLAMINT-ES-MC/issues/23#issuecomment-1655662995, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AREWZ35UUOMXV6P5VSBDXSO3DDANCNFSM6AAAAAA2IO25JQ . You are receiving this because you were mentioned.Message ID: @.***>
@matyaskopp, @rdelibanoc, @TomazErjavec, @MonicaAlbini, @charlicr
PARLAMINT-02 directory contains scripts and all files for ParlaMint but with refined notes. @matyaskopp, Could you check that this is alright? If you are going to add scripts to "makefile" you should add scripts in order (note-fixing-script-01.sh, note-fixing-script-02.sh, note-fixing-script-03.sh, note-fixing-script-04.sh and note-fixing-script-05.sh).
The implementation with the inline regex script is left-handed, but it works for now. It can be easily broken with different XML indentations.
What do you mean exactly by left-handed. Is there anything that needs doing? Best mc "w
El mar, 1 ago 2023 a las 23:27, Matyáš Kopp @.***>) escribió:
The implementation with the inline regex script is left-handed, but it works for now. It can be easily broken with different XML indentations.
— Reply to this email directly, view it on GitHub https://github.com/calzada/PARLAMINT-ES-MC/issues/23#issuecomment-1661124252, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AREVDBN5HWAGDCZVQDA3XTFYFVANCNFSM6AAAAAA2IO25JQ . You are receiving this because you were mentioned.Message ID: @.***>
Not now. It is working. But it is easily breakable because XML files are not parsed when you process them. You have to be aware of that.
You treat XML files like any other text file, which is not safe. You produced an invalid XML file and you haven't noticed that because you are not using XML , so I had to fix(https://github.com/calzada/PARLAMINT-ES-MC/commit/1bb027709d08a7674184f4029fa6417d88f0f0f8) your scripts.
eg this line:
https://github.com/calzada/PARLAMINT-ES-MC/commit/1bb027709d08a7674184f4029fa6417d88f0f0f8#diff-d0f22ad3a0f0ba6daf129c5ee75af12e0652dcab24697c06662f0579d72ffdd2L31
where you forget to add <
You process XML file line by line, so if someone adds a new line inside the note, your script will not work.
The best solution is to open XML using some XML library (eg https://metacpan.org/pod/XML::LibXML).
I don't want to muddy the waters at this late stage but I am surprised that Perl one-liners are used for these fixes, rather than upgrading cd2parmamint.xsl. There, e.g. there is even a note to this effect, and it would not be difficult to add more fine grained distinctions for notes, just by checking the contents of note: https://github.com/calzada/PARLAMINT-ES-MC/blob/131d5d0581f2f164a1e9b5f3030e106160cdb16f/bin/cd2parmamint.xsl#L397-L400
Using XSLT for processing XML is much safer than, as @matyaskopp, just treating XML as text, where any new line or extra will cause problems.
I have just spent more than an hour fixing this bug https://github.com/calzada/PARLAMINT-ES-MC/commit/06ff23428f2c096c8e37d10b275aa975c910d23d
It is a nice example that a random change in an XML file is a recipe for disaster: https://github.com/calzada/PARLAMINT-ES-MC/commit/06ff23428f2c096c8e37d10b275aa975c910d23d#diff-b7186781a58846130d0b0bd0701368fef19dbd9e23d968460cd4ce2922ccf8baL19
Oh. Sorry Matyas. You are ever so right!!! Anything I can do?
El vie, 4 ago 2023, 20:59, Matyáš Kopp @.***> escribió:
I have just spent more than an hour fixing this bug 06ff234 https://github.com/calzada/PARLAMINT-ES-MC/commit/06ff23428f2c096c8e37d10b275aa975c910d23d
It is a nice example that a random change in an XML file is a recipe for disaster: 06ff234
diff-b7186781a58846130d0b0bd0701368fef19dbd9e23d968460cd4ce2922ccf8baL19
— Reply to this email directly, view it on GitHub https://github.com/calzada/PARLAMINT-ES-MC/issues/23#issuecomment-1666049633, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AREVMBHDMXT5GPHMR6KDXTVBCBANCNFSM6AAAAAA2IO25JQ . You are receiving this because you were mentioned.Message ID: @.***>
Currently, all notes are encoded in the same way: https://github.com/calzada/PARLAMINT-ES-MC/blob/7d8412564b9686b396376d33f5cd9befb009f3c4/ParlaMint.sample/ParlaMint-ES_2015-01-20-CD150120.xml#L120
This can be improved by "annotating" them - using the proper elements and attribute values:
see: https://github.com/clarin-eric/ParlaMint/issues/696#issue-1765368729 and the documentation: https://clarin-eric.github.io/ParlaMint/#sec-comments
This implementation can be placed in cd2parmamint.xsl or in a separate script (then Makefile modification is needed)