Questions about data formatting

id4thomas commented 2 years ago

Hi, I am currently trying to reproduce your work (specifically COINS GR) and have a few questions about the training data.

From your paper it seems the training data for Knowledge Model would be

[SOS] S1 S2 [SEP] S5 [EOS] S2 # EFFECT # Effect2
[SOS] S1 S2 [SEP] S5 [EOS] S5 # CAUSE # Casue2
[SOS] S1 S2 S3 [SEP] S5 [EOS] S3 # EFFECT # Effect3
[SOS] S1 S2 S3 [SEP] S5 [EOS] S5 # CAUSE # Cause3

and

Story Model would be

Cause2 [SEP] Effect2 [EOK] [SOS] S1 S2 [SEP] S5 [EOS] S3
Cause3 [SEP] Effect3 [EOK] [SOS] S1 S2 S3 [SEP] S5 [EOS] S3

But looking at the part where you load the data (https://github.com/Heidelberg-NLP/COINS/blob/main/model/src/data/conceptnet.py) it is confusing which corresponds to which. Also, the data downloaded with the given script doens't match the format used in the rest of the code

It would be nice if you could provide a data sample for each Knowledge and Story Models or the model weight if possible.

Thank you

debjitpaul commented 2 years ago

Hi Song,

Sorry for the delayed reply. You are looking into the file for Story Model. So, Line 93 reads the input which is in the following format: self.masks[split]["total"] = [(len(i[0]), len(i[1]), len(i[2]), len(i[3]), len(i[4]), len(i[5]), len(i[6]), len(i[7]), len(i[8]), len(i[9]), len([10])) for i in sequences[split]]

During Training:
where i is the following: Incomplete Story(i.e, S1, S2 [SEP] S5) #Effect# S2 \t Ouput_Effect_S2 \t Incomplete Story(i.e, S1, S2 [SEP] S5) #Cause# S5 \t Ouput_Cause_S5 \t Incomplete Story(i.e, S1, S2 [SEP] S5) \t Incomplete Story(i.e, S1, S2 [SEP] S5) [SEP] Ouput_Effect_S2 [SEP] Ouput_Cause_S5 \t Output_S3 \t Incomplete Story(i.e, S1, S2 S3 [SEP] S5) #Effect# S3 \t Ouput_Effect_S3 \t Incomplete Story(i.e, S1, S2 S3 [SEP] S5) #Cause# S5 \t Ouput_Cause_S5 \t Incomplete Story(i.e, S1, S2 S3 [SEP] S5) \t Incomplete Story(i.e, S1, S2 S3 [SEP] S5) [SEP] Ouput_Effect_S3 [SEP] Ouput_Cause_S5 \t Output_S4 \t S2 +'\t'+ S1 +' '+ S2 +'\t'+ S5+ '\n'

I hope this answers your question. Feel free to ask me any questions.

id4thomas commented 2 years ago

Thank you for the feedback!

However, it is still hard to understand the given example..

Considering both files below

https://github.com/Heidelberg-NLP/COINS/blob/main/model/src/data/conceptnet.py

https://github.com/Heidelberg-NLP/COINS/blob/main/model/src/train/batch.py

in the for loop of batch_conceptnet_generate function (line 92)

i1, o1, .. names taken from line 257 of conceptnet.py (do_example)
seq[] taken from line 127 of [conceptnet.py](http://conceptnet.py/) onwards (make_tensors)

when i==0

[:,0,0,:]: input_knowledge → seq[0] + seq[1] → i1 + o1
[:,1,0,:]: input_story_completion → seq[2] + seq[3] → i2 + o2

when i==1

[:,0,1,:]: input_knowledge → seq[4] + seq[5] → i3 + o3
[:,1,1,:]: input_story_completion → seq[6] + seq[7] → i4 + o4

So does it mean i1, o1, i3, o3 corresponds to

Incomplete Story, Ouput_Effect_S2/Ouput_Cause_S5, Incomplete Story, Ouput_Effect_S3/Ouput_Cause_S5

and i2,o2, i4, o4 to

Incomplete Story, Output_S3, Incomplete Story, Output_S4?

Also when splitting the example given at line 94 of conceptnet.py (make_tensors) the list would be

Incomplete Story(i.e, S1, S2 [SEP] S5) #Effect# S2
Ouput_Effect_S2
Incomplete Story(i.e, S1, S2 [SEP] S5) #Cause# S5
Ouput_Cause_S5
Incomplete Story(i.e, S1, S2 [SEP] S5)
Incomplete Story(i.e, S1, S2 [SEP] S5) [SEP] Ouput_Effect_S2 [SEP] Ouput_Cause_S5
Output_S3
Incomplete Story(i.e, S1, S2 S3 [SEP] S5) #Effect# S3
Ouput_Effect_S3
Incomplete Story(i.e, S1, S2 S3 [SEP] S5) #Cause# S5
Ouput_Cause_S5
Incomplete Story(i.e, S1, S2 S3 [SEP] S5)
Incomplete Story(i.e, S1, S2 S3 [SEP] S5) [SEP] Ouput_Effect_S3 [SEP] Ouput_Cause_S5
Output_S4
S2 +'\t'+ S1 +' '+ S2 +'\t'+ S5+ '\n’

It doesn’t seem to match the 11 sequences the code expects.

Heidelberg-NLP / COINS

Questions about data formatting #3