ArneBinder / dialam-2024-shared-task

see http://dialam.arg.tech/

0 stars 0 forks source link

understanding the data & task #1

Open ArneBinder opened 4 months ago

ArneBinder commented 4 months ago

We will use the QT30 corpus [1], the largest available corpus in dialogical argumentation in English. QT30 is a collection of 30 episodes of Question Time aired between June 2020 and November 2021, with a total of more than 29 hours of transcribed broadcast material and comprises of 19,842 locutions by more than 400 participants: one moderator, 125 panel members (7 of them appearing more than once), and 300+ audience members. The QT30 dataset contains 10,818 propositional relations divided into Default Inferences, Default Conflicts, and Default Rephrases, and 32,303 illocutionary relations divided into Asserting, Agreeing, Arguing, Disagreeing, Restating, Questioning, and Default Illocuting.

QT30 corpus: http://corpora.aifdb.org/qt30

Open Questions:

[ ] Do we need to do relation link prediction or just the link classification?
[ ] ???

tanikina commented 4 months ago

Node Type Annotations

Some examples for the node types based on nodeset17940.json from the training set.

AIFdb visualization for the corresponding argument map: http://www.aifdb.org/argview/17940.

L-node is used for locutions (speaker + what they actually said):

      {
            "nodeID": "512946",
            "text": "Camilla Tominey : that's not something we want",
            "type": "L",
            "timestamp": "2020-05-28 20:31:10"
      },

I-node (information-node) is used with propositions (propositions are "reconstructed locutions, where linguistic features like anaphora, pronouns, and deixis are resolved" see annotations):

    {
          "nodeID": "512948",
          "text": "risking the spread of COVID-19 is not something we want",
          "type": "I",
          "timestamp": "2020-05-28 20:31:10"
    },

YA-node connects several types of nodes:

YAs between I- and L-nodes can have the following "text": Asserting, Challenging, Pure Questioning, Assertive, Questioning, Rhetorical Questioning.
YAs between TAs and S-nodes: Arguing, Disagreeing, Default Illocuting, Restating
YAs between TAs and I-nodes: Agreeing, Challenging, Disagreeing

Here we have an edge connecting YA-node-"512947" with L-node-"512946".

      {
            "nodeID": "512947",
            "text": "Asserting",
            "type": "YA",
            "timestamp": "2020-05-28 20:31:10",
            "scheme": "Asserting",
            "schemeID": "74"
      },

S-node connects I-nodes and can have the following "type": "RA" stands for inference, "CA" for conflict and "MA" for rephrase. Note that we also have an edge connecting S-node-"512950" that has the type annotation and the corresponding I-node-"512948" (shown above).

      {
            "nodeID": "512950",
            "text": "Default Inference",
            "type": "RA",
            "timestamp": "2020-05-28 20:31:11",
            "scheme": "Default Inference",
            "schemeID": "72"
      },

Having an edge between the S-node-"512950" (above) and another I-node-"512944" (shown below) means that there is an inference relation between the two propositions: "risking the spread of COVID-19 is not something we want" and "there is a risk to children of perhaps contracting COVID-19 and spreading it to vulnerable adults". One statement supports and provides the reason for another, hence it is annotated as "inference".

      {
            "nodeID": "512944",
            "text": "there is a risk to children of perhaps contracting COVID-19 and spreading it to vulnerable adults",
            "type": "I",
            "timestamp": "2020-05-28 20:31:10"
      },

In general, I and L are content nodes with the text for propositions and locutions, respectively, and they are given as a set of nodes at the test time. TA nodes are transitions between the L nodes in dialogue and they are also always given. The task is about identifying the YA and S nodes with the relation annotations that either connect I and L nodes (YA-nodes) or connect two I nodes (S-nodes).

In the test dataset all the information provided will be the set of unlinked I-nodes and a set of L-nodes linked by transitions (TA-nodes)."

The task definition as specified in the Shared Task guidelines:

The goal in the DialAM task is to correctly detect illocutionary relations (YA-nodes) and propositional relations (RA-, CA-, and MA-nodes), producing an edited argument map containing these new identified relational nodes together with new edges linking them to the locutions (L-nodes) and the argumentative propositions (I-nodes).

The main goal of the DialAM task is therefore twofold: First, to identify the existing relational nodes (RA-, CA-, MA-nodes) between propositions (I-nodes) and generate the respective edges linking all the information in the argument map. Similarly, the second goal is to identify any existing illocutionary relations (YA-nodes) between locutions (L-nodes) and propositions (I-nodes).

Useful Resources:

Task Baseline? Transformer-Based Models for Automatic Identification of Argument Relations: A Cross-Domain Evaluation

tanikina commented 4 months ago

Open Questions:

* [ ]  Do we need to do relation link prediction or just the link classification?

Yes, we need to do both (edge prediction and node type classification). I asked in the Shared Task Slack channel and here is the reply from the organizers:

Screenshot from 2024-03-04 16-31-07

ArneBinder commented 4 months ago

assumptions:

The number of I-nodes is the same as of the L-nodes.
We can order both by timestamps and the resulting pairings, (L-Node_i, I-Node_i), are positive illocutionary relation instances, called YA-Nodes, but we don't know there class.
There are transition relations between L-nodes, i.e. (L-Node_i, L-Node_j), with i<j, called TA-nodes, and they are given. For each such relation, there may be a propositional relation (I-Node_j, I-Node_i) or (exclusive?) (I-Node_i, I-Node_j) called S-node of class RA, MA, or CA (from the IAT guidelines, Sec. 5, Connections with propositional relations: "All RAs, CAs and MAs must be anchored through ICs in TAs."). Simplified, the "propositional relation" (I-Node_j, I-Node_i) is of class RA, MA, CA, RA-rev, MA-rev, CA-rev, or NONE, where the suffix -rev indicates the respective reversed relations.
Furthermore, for any positive (not NONE class) propositional relation (I-Node_i, I-Node_j) there needs to be a YA relation ((L-Node_i, L-Node_j), (I-Node_i, I-Node_j)) with class xxx, yyy, or zzz.
However, there may be more YA relations! From the IAT guidelines, Sec. 5, Basics: "Each locution will typically anchor a single illocutionary connection, but may anchor more than one"

tanikina commented 4 months ago

Yes, this is also my understanding! Additionally, according to the annotation details document, we also need to classify YA relations between TA and S-nodes (TA-Node_i, S-Node_i) as well as between TA and I-nodes. I'm not sure about the TA → I transitions though since I have not seen any examples so far.

EDIT: There are no TA → I transitions in the training data and direct TA → S transitions are very rare. However, TA → YA → S transitions are quite important (see the node2node transition table in the next comment).

Also, regarding the propositional relations, I think we can safely assume that MA and CA only go up (I-Node_j, I-Node_i) and RA can point both up (I-Node_j, I-Node_i) or down (I-Node_i, I-Node_j). At least that's how they specify them in the annotation details document.

I can check the training data and compile some statistics for each of the relation types (e.g., how many times we have each relation and which nodes are involved). Would that be useful?

tanikina commented 3 months ago

I'm still not sure whether this is insightful but here is some statistics based on the node2node transitions from the training set. The table was generated using the count_statistics.py script (format: label-count for each valid transition/edge).

It seems that the most important/common transitions are between the following nodes:

L → YA → I (transitions from dialogue-based L-nodes to info-nodes via YA-nodes)
TA → YA → S (transitions from TA-nodes that specify the dialogue flow to S-nodes: MA, RA, CA via YA-nodes)
I → S and S → I (transitions that classify the relations between the info-nodes: Inference, Rephrase, Conflict)
L → TA and TA → L (dialogue flow transitions, these are given)

to_node → from_node ↓	YA	L	TA	I	MA	RA	CA
YA	-	Asserting-420 Analysing-255 PureQuestioning-7 DefaultIllocuting-6 AssertiveQuestioning-5 Arguing-3 Agreeing-2 Restating-1 Challenging-1	Arguing-1	Asserting-18780 PureQuestioning-1185 AssertiveQuestioning-239 RhetoricalQuestioning-222 Agreeing-215 NoLabel-160 DefaultIllocuting-136 Challenging-57 Disagreeing-50 Arguing-12 Restating-5	Restating-4056 NoLabel-1097 DefaultIllocuting-614 Arguing-12 Agreeing-6 Disagreeing-3 Asserting-1	Arguing-5067 NoLabel-394 DefaultIllocuting-63 Asserting-22 Restating-20 Agreeing-17 PureQuestioning-10 RhetoricalQuestioning-2 AssertiveQuestioning-1 Challenging-1 Disagreeing-1	Disagreeing-931 NoLabel-234 Challenging-39 Arguing-8 Restating-7 DefaultIllocuting-5
L	Asserting-19195 PureQuestioning-1192 Analysing-256 AssertiveQuestioning-244 RhetoricalQuestioning-222 DefaultIllocuting-139 Agreeing-109 Challenging-41 Disagreeing-21 Arguing-15 Restating-6 NoLabel-3 DefaultTransition-2 DefaultRephrase-1	NoLabel-7 DefaultTransition-2 DefaultInference-2	DefaultTransition-20173 NoLabel-2857 DefaultRephrase-1 Asserting-1	Disagreeing-1 DefaultRephrase-1	DefaultRephrase-6	DefaultInference-7	DefaultConflict-3
TA	DefaultTransition-11050 NoLabel-1884	DefaultTransition-20178 NoLabel-2857	-	-	DefaultTransition-5	DefaultTransition-1	DefaultTransition-1
I	Asserting-32 PureQuestioning-1	DefaultTransition-2 DefaultRephrase-1 DefaultIllocuting-1	-	DefaultConflict-1	DefaultRephrase-4732 NoLabel-1071 DefaultTransition-1	DefaultInference-6116 NoLabel-386 Arguing-1 DefaultConflict-1	DefaultConflict-997 NoLabel-229 Arguing-1 DefaultIllocuting-1
MA	-	DefaultRephrase-12	-	DefaultRephrase-4730 NoLabel-1077	DefaultRephrase-5	-	-
RA	-	DefaultInference-8	-	DefaultInference-5282 NoLabel-379	-	-	-
CA	-	DefaultConflict-4	-	DefaultConflict-992 NoLabel-230	-	-	DefaultConflict-1

ArneBinder commented 3 months ago

Oh, this is very interesting! However, I do not fully understand the column / row sets (YA, L, TA, I, MA, RA, CA). I would expect L, I, S, TA instead because that are the types of relation arguments and in the end, we would classify these pairs (if i understand it correctly). Or what was the reasoning behind your choice? Maybe I missed sth.

Another note: If it is not much effort, can we have the table in markdown? I think pandas dataframes provide a to_markdown method. But if that does not work out of the box, I think it is fine to keep it as it is.

tanikina commented 3 months ago

However, I do not fully understand the column / row sets (YA, L, TA, I, MA, RA, CA). I would expect L, I, S, TA instead because that are the types of relation arguments and in the end, we would classify these pairs (if i understand it correctly). Or what was the reasoning behind your choice?

Here I just collected all possible/valid transitions and their statistics (including the edges that we don't need to predict). S nodes are basically represented as MA, RA and CA nodes in the data (they are no "S" nodes in the original dataset) and since they have different labels and participate in different transitions, I think it might be useful to keep them in separate rows/columns. We also need YA nodes because we have to predict/annotate them in the following transitions: L → YA → I and TA → YA → S (at least as far as I understand the task).

Another note: If it is not much effort, can we have the table in markdown? I think pandas dataframes provide a to_markdown method.

Sure, no problem! Now we have it in markdown :)

tanikina commented 3 months ago

This is a new table with the statistics for the input nodes (computed with this script). We are given L, I and TA nodes as input and need to predict the following transitions (i.e., whether there is a link between the two input nodes and which type/"scheme" should be assigned to it):

for each (L-Node_i, I-Node_i) pair: L → YA → I YA: Asserting, PureQuestioning, AssertiveQuestioning, RhetoricalQuestioning, DefaultIllocuting, Agreeing, Challenging, Disagreeing, Arguing, Restating, NoLabel
for each (I-Node_i, I-Node_j) or (I-Node_j, I-Node_i) pair with i<j: I → S → I S: MA (DefaultRephrase), RA (DefaultInference), CA (DefaultConflict)
for each (TA-Node_i, S-Node_i) pair: TA → YA → S YA: Arguing, Restating, NoLabel, Disagreeing, DefaultIllocuting, Challenging, Asserting, Agreeing, PureQuestioning, RhetoricalQuestioning, AssertiveQuestioning

YA nodes basically serve as "edge labels" in this task since we don't have any edge labels in the data, only the node labels. S nodes should be predicted based on the I → S → I transitions.

input nodes	L	I	S	TA
L	L → TA → L DefaultTransition: 20206 NoLabel: 2857 L → YA → L Asserting: 420 Analysing: 255 PureQuestioning: 7 DefaultIllocuting: 6 AssertiveQuestioning: 5 Arguing: 3 Agreeing: 2 Restating: 1 Challenging: 1 L → MA → L DefaultRephrase: 2	L → YA → I Asserting: 18779 PureQuestioning: 1185 AssertiveQuestioning: 239 RhetoricalQuestioning: 222 DefaultIllocuting: 133 Agreeing: 107 Challenging: 40 Disagreeing: 21 Arguing: 12 Restating: 5 NoLabel: 3 L → RA → I DefaultInference: 6 L → MA → I DefaultRephrase: 4 L → CA → I DefaultConflict: 3	L → TA → S DefaultTransition: 7	-
I	I → MA → L DefaultRephrase: 10 I → RA → L DefaultInference: 8 I → CA → L DefaultConflict: 4	I → RA → I DefaultInference: 6117 NoLabel: 371 I → MA → I DefaultRephrase: 4730 NoLabel: 1053 I → CA → I DefaultConflict: 995 NoLabel: 226 I → YA → I Asserting: 32 PureQuestioning: 1	-	-
S	-	-	-	-
TA	-	TA → YA → I NoLabel: 157 Agreeing: 109 Disagreeing: 29 Challenging: 17 DefaultIllocuting: 3 Asserting: 2 PureQuestioning: 1	TA → YA → S Arguing: 5090 Restating: 4083 NoLabel: 1725 Disagreeing: 935 DefaultIllocuting: 682 Challenging: 40 Asserting: 23 Agreeing: 23 PureQuestioning: 10 RhetoricalQuestioning: 2 AssertiveQuestioning: 1 TA → MA → S DefaultRephrase: 5	-

ArneBinder commented 3 months ago

I have also created some code to do statistics. I added the code to the same script, but you can just comment out the last lines to bring it back to the previous state. However, it results in the following (edited: new version with relation node types and counts sorted by identifier):

	I	L	S	TA	YA
I	S/DefaultConflict: 1221 S/DefaultInference: 6488 S/DefaultRephrase: 5783 YA/Asserting: 32 YA/PureQuestioning: 1	S/DefaultConflict: 4 S/DefaultInference: 8 S/DefaultRephrase: 10	-	-	-
L	S/DefaultConflict: 3 S/DefaultInference: 6 S/DefaultRephrase: 4 YA/Agreeing: 107 YA/Arguing: 12 YA/Asserting: 18782 YA/AssertiveQuestioning: 239 YA/Challenging: 40 YA/DefaultIllocuting: 133 YA/Disagreeing: 21 YA/PureQuestioning: 1185 YA/Restating: 5 YA/RhetoricalQuestioning: 222	S/DefaultInference: 1 S/DefaultRephrase: 2 TA/DefaultTransition: 23063 YA/Agreeing: 2 YA/Analysing: 255 YA/Arguing: 3 YA/Asserting: 420 YA/AssertiveQuestioning: 5 YA/Challenging: 1 YA/DefaultIllocuting: 6 YA/PureQuestioning: 7 YA/Restating: 1	TA/DefaultTransition: 7	-	TA/DefaultTransition: 12959
S	S/DefaultConflict: 1 S/DefaultRephrase: 5	-	-	-	-
TA	S/DefaultInference: 1 YA/Agreeing: 209 YA/Asserting: 2 YA/Challenging: 43 YA/DefaultIllocuting: 3 YA/Disagreeing: 60 YA/PureQuestioning: 1	-	S/DefaultConflict: 1 S/DefaultRephrase: 5 YA/Agreeing: 23 YA/Arguing: 5484 YA/Asserting: 23 YA/AssertiveQuestioning: 1 YA/Challenging: 40 YA/DefaultIllocuting: 1779 YA/Disagreeing: 1169 YA/PureQuestioning: 10 YA/Restating: 4083 YA/RhetoricalQuestioning: 2	YA/Arguing: 1	-
YA	S/DefaultConflict: 1216 S/DefaultInference: 5581 S/DefaultRephrase: 5765	S/DefaultConflict: 4 S/DefaultInference: 8 S/DefaultRephrase: 11 TA/DefaultTransition: 1	-	-	TA/DefaultTransition: 1

Unfortunately, this seems to be different than the above table, but it should'nt... I guess?

tanikina commented 3 months ago

You are right, of course! I had the "No Label" labels in the table which doesn't make much sense. I updated the code and now it seems to generate the same numbers. Thanks a lot for checking and implementing another version! I think we should keep your version as a reference :)

input nodes	I	L	S	TA	YA
I	I → S → I DefaultInference: 6488 DefaultRephrase: 5783 DefaultConflict: 1221 I → YA → I Asserting: 32 PureQuestioning: 1	I → S → L DefaultRephrase: 10 DefaultInference: 8 DefaultConflict: 4	-	-	-
L	L → YA → I Asserting: 18782 PureQuestioning: 1185 AssertiveQuestioning: 239 RhetoricalQuestioning: 222 DefaultIllocuting: 133 Agreeing: 107 Challenging: 40 Disagreeing: 21 Arguing: 12 Restating: 5 L → S → I DefaultInference: 6 DefaultRephrase: 4 DefaultConflict: 3	L → TA → L DefaultTransition: 23063 L → YA → L Asserting: 420 Analysing: 255 PureQuestioning: 7 DefaultIllocuting: 6 AssertiveQuestioning: 5 Arguing: 3 Agreeing: 2 Restating: 1 Challenging: 1 L → S → L DefaultRephrase: 2 DefaultInference: 1	L → TA → S DefaultTransition: 7	-	L → TA → YA DefaultTransition: 12959
S	S → S → I DefaultRephrase: 5 DefaultConflict: 1	-	-	-	-
TA	TA → YA → I Agreeing: 209 Disagreeing: 60 Challenging: 43 DefaultIllocuting: 3 Asserting: 2 PureQuestioning: 1 TA → S → I DefaultInference: 1	-	TA → YA → S Arguing: 5484 Restating: 4083 DefaultIllocuting: 1779 Disagreeing: 1169 Challenging: 40 Asserting: 23 Agreeing: 23 PureQuestioning: 10 RhetoricalQuestioning: 2 AssertiveQuestioning: 1 TA → S → S DefaultRephrase: 5 DefaultConflict: 1	TA → YA → TA Arguing: 1	-
YA	YA → S → I DefaultRephrase: 5765 DefaultInference: 5581 DefaultConflict: 1216	YA → S → L DefaultRephrase: 11 DefaultInference: 8 DefaultConflict: 4 YA → TA → L DefaultTransition: 1	-	-	YA → TA → YA DefaultTransition: 1

ArneBinder commented 3 months ago

Open questions regarding the data

What is the meaning of the timestamp in the case of each node type (i.e. I, L, TA, YA, S)?
It looks like there are disconnected L-nodes, what does this mean? See nodes 599519, 599523, 599527, 599534, 599537 in nodeset25524.
There are S-nodes that are not linked by any YA node to a TA node, but the IAT guidelines, Sec. 5, Connections with propositional relations, state that "All RAs, CAs and MAs must be anchored through ICs in TAs". See node 1021263 in nodeset25524.
Why do we have duplicate nodes (861680 and 861681 even have the same timestamp)? nodeset23720: {"nodeID":"861679","text":"Enter your text here...","type":"L","timestamp":"2022-01-14 10:49:33"} {"nodeID":"861680","text":"Enter your text here...","type":"L","timestamp":"2022-01-14 10:49:34"} {"nodeID":"861681","text":"Enter your text here...","type":"L","timestamp":"2022-01-14 10:49:34"}
Is it correct to have multiple relation nodes that connect a pair of nodes? Does this mean, this is a multi label relation classification task (instead of just multi-class)? see nodeset25461: Multiple relation nodes (types: {'TA'}) between 720315 (type: L) and 719763 (type: L), or see nodeset25524: Multiple relation nodes (types: {'RA', 'MA'}) between 599605 (type: I) and 599599 (type: I)

tanikina commented 3 months ago

Answers from DialAM organizers

Question 1: What is the meaning of the timestamp in the case of each node type (i.e. I, L, TA, YA, S)? Answer: This questions has not been yet answered by the organizers but my understanding is that L and I nodes are more-or-less aligned according to their timestamps which correspond to the original dialogue flow and other timestamps (for TA, YA, S nodes) look quite random and maybe show an order in which they were annotated (L and I node always have the same date based on the BBC broadcast, e.g., 2020-11-19 but other nodes have much later dates, e.g., 2022-06-24).

Questions 2 & 4: (Q2) It looks like there are disconnected L-nodes, what does this mean? See nodes 599519, 599523, 599527, 599534, 599537 in nodeset25524. (Q4) Why do we have duplicate nodes (861680 and 861681 even have the same timestamp)? Answer: I’ve just checked and it looks like some of the nodes are duplicated (e.g., 599516 is the same as 599519). The duplicated ones are isolated, and they should be not considered in your analysis as they will not be considered during evaluation since they are redundant.

Question 3: There are S-nodes that are not linked by any YA node to a TA node, but the IAT guidelines, Sec. 5, Connections with propositional relations, state that "All RAs, CAs and MAs must be anchored through ICs in TAs". See node 1021263 in nodeset25524. Answer: Yes, such cases are possible, e.g., when we have a rephrase between two propositions, and a linked argument between one of these propositions together with a third. See the image below where we have "Default Inference" node between "if one had never..." and "look at the potential side effects..." that is not anchored via TA node.

Question 5: Is it correct to have multiple relation nodes that connect a pair of nodes? Does this mean, this is a multi label relation classification task (instead of just multi-class)? see nodeset25461: Multiple relation nodes (types: {'TA'}) between 720315 (type: L) and 719763 (type: L), or see nodeset25524: Multiple relation nodes (types: {'RA', 'MA'}) between 599605 (type: I) and 599599 (type: I) Answer: The case that you mention is one where there is a rephrase between two propositions, and a linked argument between one of these propositions together with a third (see the image above). Therefore, yes, it is possible to have multiple relation nodes between the same pair of nodes but is not as simple as a multi-label classification problem. In the proposed example, the inference relation only happens because there is a third proposition involved, if not, there would only have been a rephrase relation between two propositions.

Note: we can remove all disconnected nodes from the training data! As for multiple relations (Q5), I am not sure how feasible this is from the modelling point of view. Based on our statistics there are not many cases that have multiple relations between the same I nodes. There are plenty of cases with multiple relations between L nodes (multiple TA relations) but I guess we can simply assume that those are given at test time.

tanikina commented 3 months ago

Unfortunately, we cannot rely on the timestamps of I and L-nodes since they can be arbitrary and just show when the nodes were added to the graph with some external annotation tool (at least that's how I understood the response from the organizers). The only reliable timestamps are those that are specified under locutions and they are available only for some L-nodes (e.g., L-nodes 70682, 706806, 706835 in nodeset 21303 are missing in locutions).

I prepared a script to test different "automatic" ways of aligning the I and L-nodes based on the embedding similarity, token overlap etc. It is currently on the branch called node_alignment_experiments: align_i2l_nodes.py Some parts were copied/adopted from the visualize_arg_map.py script :) See here for the results.

PS: We also need to think about the train/dev/test splits since those are not given out-of-the-box. The organizers uploaded some examples for the test data format but they include only three nodesets: http://dialam.arg.tech/res/files/sample_test.zip

ArneBinder commented 3 months ago

TODOs:

[ ] method for normalizing relations (parametrized by relation type):
1. get all relations (of a certain type),
2. get TA relations that anchor them,
3. get all relations from (1), which are not the swapped with respect to the anchoring TA relation -> that are the reversed relations!
4. intermediate step: collect statistics for output of (4): are really only RA relations reversed?
5. reverse relations from (4): modify original relation node (add -rev to type) and respective relation edges (swap direction)
[ ] data cleanup script:
- remove invalid relations: 1. gather all valid relations, 2. remove all edges that do not take part in any valid relation
- remove isolated nodes
- optionally, normalize RA-relations (see above)
[ ] collect statistics: min/mean/max/distribution
- number of tokens for concatenated locutions or propositions (see open questions): does this fit into the default 512 input size? EDIT (TA): yes, it should fit. E.g., with roberta-base tokenizer we get the following numbers: Total L-nodes: 49550, Total I-nodes: 23465 L-nodes: max-tokens: 158, min-tokens: 3, mean-tokens: 21.68 I-nodes: max-tokens: 92, min-tokens: 3, mean-tokens: 18.10 Even if we concatenate the longest text of L-node with the longest text of I-node we are still under 512 limit (158+92=250 tokens).
- what else?

Open questions:

S-node classification is relation classification ~without~ with NONE type (see EDIT below). But on what text do we operate? since we align I with L nodes, we could, in theory, frame the S-node classification as TA-node classification... Pro argument for operating on the propositions: the text may have better prepared information; pro argument for operating on the locutions: more close to real world text on what the language model that we may use is pretrained on
since we have the I-L- and S-TA-alignment, YA classification can be framed as relation classification for YA nodes that anchor S nodes and as span classification for YA nodes that anchor I nodes. Again, is it better to operate on the locutions or propositions?
for sub-tasks framed as relation classification: one relation per forward pass (mark arguments with special markers and classify CLS token) or all relations at once per forward pass (get embeddings for all relation arguments and combine them into one embedding per relation, classify these. but how to combine them? note that we can have multiple sources or targets for one relation)
for subtasks framed as span classification: as unary relations classification? or as token classification (e.g. tags from BIO-encoded span types)?
do we pass all input text (locutions or propositions) into the model or, since we know that, just the parts to classify?
what do we do with TA → YA → I transitions? Currently, we filter them out completely when cleaning the data with cleanup_data.py. What is the meaning of such relations? Here we have some statistics for all the transitions that occur in the dataset, in total we have 318 TA → YA → I transitions.

EDIT: Feedback from discussion with Leo:

NONE class for S nodes is also relevant! there are TA nodes that do not have a respective S node, but we mirror all of the TA nodes as potential S nodes. To create correct training data, we need to create new S nodes with type NONE by mirroring the TA nodes that do not already have an S node
number of incoming / outgoing edges determines possible relation types: according to the guidelines, only RA-nodes (default inference) can have multiple incoming / outgoing edges
just create binary relations from n-ary RA relations: should be possible to deterministically convert back to n-ary by using direction of relations
having only binary relations (and unary relations for span classification) allows to easily create fixed-size "relation embeddings" to classify the relations e.g. by using the special token embeddings of the relation arguments
looking into matching the blanks paper for how to create the "relation embedding": concat(head-start, tail-start)

EDIT2: binarizing relations seems to be not so easily possible because that can create multiple relations between the same pair of nodes, see this example that Ramon presented in the slack channel (look at the two top I-nodes, binarizing would create two relations between them, one with label default inference and the other with default rephrase): Screenshot 2024-03-14 at 18 20 09