clarin-eric / ParlaMint

ParlaMint: Comparable Parliamentary Corpora
https://clarin-eric.github.io/ParlaMint/
41 stars 52 forks source link

New div types? #472

Closed matyaskopp closed 1 year ago

matyaskopp commented 1 year ago

This should be more like a hotfix for partners that have structured proceedings and don't want to remove any data. (I will extend this, creating this to be able to refer to it)

TomazErjavec commented 1 year ago

I am against new div types, esp. ones like "table-of-content" because ParlaMint is an instantiation of the more general Parla-CLARIN, which states that "While the will contain the transcription proper (i.e. the speeches), the [front] contains preamble text, and the [back] various appendices or texts that are related to the speeches." Furthermore, the div element there already makes recommendations on the suggested values of the type attribute, but all of them under the assumption that the div contains transcriptions.

So, for the case of #437 and under the assumption that the ToC should not be removed, as I wrote there, we have the following two options:

  1. Have a bunch of ToC related notes (I imagine with @type="toc") at the start of the div (ugly but simple)
  2. Introduce the front element into ParlaMint (beautiful, but requires changing the schema, ODD + guidelines, and maybe (auto) fixes to existing corpora, although I think we can live without the last)

I can do 2. if the feeling about this is strong enough.

ninpnin commented 1 year ago

The ParlaClarin guideline states this

If used, the values of the type and subtype attributes will depend on the parliamentary rules of the particular country, on the need to distinguish the types of divisions, as well as on the ability to automatically recognise them or the available effort to manually add them.

<body>
 <div> ...
 <div type="representation">
   <head>Representation of members of the Federal Government</head>
     ...
  </div>
  <div type="topical">
   <head>Hour of topical interest</head>
     ...
  </div>
  <div type="request">
   <head>Announcement of an urgent request</head>
     ...
  </div>
 </div>
</body>

In practice, @type=debateSection seems to be used for most ParlaMint corpora. Our data has pre-debate announcements etc. generic notes, debates, post-debate announcements and sometimes annoucements and etc. generic notes between several debate sections. For us it would be easiest just to label those sections as they are, and try to use @type=debateSection wherever there are people actually talking.

Is there a reason not the grant the same lenience here as ParlaClarin grants?

matyaskopp commented 1 year ago

Agree that ToC shouldn't be in data because it is reconstructible from data (there is no additional information)

I was thinking more about different types of div. I have just reviewed ParlaMint-NO (@tungland), (do not comment on note/@type it is already reported here #473 ) and it seems that new div types are needed because there are sections(div)

TomazErjavec commented 1 year ago

OK, what about then having a new type of div, <div type="notes"> which should contain only notes, possibly preceded by <head>?

tungland commented 1 year ago

@matyaskopp What is happening in your second example: it is recorded as a debate section having been held, but no utterances recorded. Possibly no arguments were held, and they did not bother with the formalities from the speaker.

The context here is that this is based on an official transcript from Norway's "pseudo" lower house, Odelstinget. Norway abolished it's pseudo-bicameral system in 2009, but for many years before that, meetings in these bodies were becoming increasing ceremonial, just moving through empty formalities, with actual debate happening during joint sessions.

matyaskopp commented 1 year ago

OK, what about then having a new type of div, <div type="notes"> which should contain only notes, possibly preceded by <head>?

ok, it is the best we can think of now. @TomazErjavec, probably this is a more consistent type value:

<div type="noteSection">...</div>
TomazErjavec commented 1 year ago

OK, did it. Good idea about "noteSection", but I then chose "commentSection", as we don't really have a guarantee that there will only be <note>s inside, some heuristic might change them to <incident> and similar, so I allow those too. The schema has not been tested much, I hope it works. For the record, the Guidelines have been changed here and here, while the schema for div is now https://github.com/clarin-eric/ParlaMint/blob/ad0a3a78ce8bda4cd6d5bd91bf60dcedf15e690a/Schema/ParlaMint-TEI.rng#L416-L454 Hm, I just notice now I forgot to remove tabs from the schema, sorry!

TomazErjavec commented 1 year ago

Hm, I just notice now I forgot to remove tabs from the schema, sorry!

Removed all tabs in 0f1e1bb (also in Scripts).

matyaskopp commented 1 year ago

@TomazErjavec, I did want to close this issue because I think there is no need (at least for now) to add new div types.

Currently, we support:

but I noticed that schema does not enforce u in debateSection so I am leaving it open.

TomazErjavec commented 1 year ago

I noticed that schema does not enforce u in debateSection so I am leaving it open.

This should be solved in 8103e9e, docu branch.

matyaskopp commented 1 year ago

I noticed that schema does not enforce u in debateSection so I am leaving it open.

This should be solved in 8103e9e, docu branch.

seems ok to me, merged to the data branch.

closing