debjitpaul / Multi-Hop-Knowledge-Paths-Human-Needs

Ranking and Selecting Multi-Hop Knowledge Paths to Better Predict Human Needs (NAACL 2019)
https://www.aclweb.org/anthology/N19-1368
17 stars 9 forks source link

About data #3

Closed victorup closed 4 years ago

victorup commented 5 years ago

Hello, how can I get the "inputfile" as mentioned in the Input Sample? And is "graphpath" the "concept_graph_full" from previous step? How can I get "train_data.txt", "dev_data.txt" and "test_data.txt"? Thank you!

debjitpaul commented 5 years ago

Hi, The generation of input file requires the csv file that you can download from https://uwnlp.github.io/storycommonsense/ as mentioned in the Readme.md. However, you extract the concepts per sentence is an internal code which we don't make public. But, it should be simple as n-gram token matching with and w/o lemmatization.

Yes, graphpath is concept_graph_full Best, Debjit

victorup commented 5 years ago

Thank you for your answer! Is the only dev and test available in the csv file? Because I found just dev and test have plutchik and reiss. In that case, I've got about 5,000 statistics. So I don't know how the data in your “Table1: Dataset Statistics” in your paper is counted. And in your paper, do you only predict the reiss not the plutchik, right?

debjitpaul commented 5 years ago

Hi,

I predict both. Please check out the data prep folder. If you run "read_pluctik_emotion.py" and " read_human_needs_reiss.py" with the filepath (csv) as input you should get the extact data statistics.

Yes, there are dev and test file. In the paper we mention that we use 80% of the dev as training and rest 20% as dev.

Thanks

victorup commented 5 years ago

Hello, Debjit I want to know how should I extract the concepts for each sentence using n-gram matching. I have got "concept_graph_full". So now I want to extract the concepts for each sentence using n-gram matching to get the “input file” like “input sample”. But I find that there are 1.5 million nodes in the concept_graph_full. They are too large to search, so I want to know how to match the concepts quickly. I use the NGram Module. Thank you!

victorup commented 5 years ago

Hello, When I run "make_sub_graph_server.py", construct subgraph per sentence needs 5 minutes. Is this normal?

debjitpaul commented 5 years ago

Hi,

How many concepts are you considering per instance? Do you only consider the concepts found in each sentence or sentence+ directly preceding sentence in the context or sentence+ whole preceding context?

As mentioned in our paper we have considered concepts found in the sentence + directly preceding sentence in the context.

Best, Debjit

victorup commented 5 years ago

Hello, Yeah, I just consider the sentence + directly preceding sentence in the context. I found the "concept_graph_full" have 1.5million "name" nodes. Is it right? I doubt it is too large to get paths.

debjitpaul commented 5 years ago

Hi,

Concept_graph_full is the full ConceptNet as graph. As a next step, we propose to construct an induced subgraph. I am curious about how many concepts on average are you extracting for each instance? For example,

a2ddbb50-e45b-4ad3-becf-b2d8475172bf__sent1 food I began making fish curry for my boyfriend and I. ['make', 'begin', 'fish', 'curry', 'boyfriend', 'friend']

Here there are 6 concepts found.

It might take up to 5 minutes (it took me 1-2 minutes) for that you need to split the dataset and run them parallelly which will speed up the process and you have all the subgraph within a few hours.

victorup commented 5 years ago

Thank you! I deleted the stopwords before extracting the concepts. So I have fewer concepts. Like these,

a2ddbb50-e45b-4ad3-becf-b2d8475172bf_sen1 ['fish', 'curry', 'boyfriend'] a2ddbb50-e45b-4ad3-becf-b2d8475172bf_sen2 ['recipe', 'decided', 'curry', 'life', 'fish', 'boyfriend', 'read'] a2ddbb50-e45b-4ad3-becf-b2d8475172bf_sen3 ['recipe', 'sit', 'curry', 'decided', 'life', 'tasting', 'read'] a2ddbb50-e45b-4ad3-becf-b2d8475172bf_sen4 ['tasting', 'time', 'disgusted', 'sit', 'curry'] a2ddbb50-e45b-4ad3-becf-b2d8475172bf_sen5 ['garlic', 'time', 'onion', 'disgusted', 'accidentally'] 7148ffc4-0d04-4d9f-9286-ae7ae61a813a_sen1 ['time', 'single']

victorup commented 5 years ago

Hello, Do you have the sample of "train_data.txt"? I don't how you predict when there are more than one person in a sentence. Do you predict them separately? Thanks!

debjitpaul commented 5 years ago

Yes. I count them as separate instances.

victorup commented 5 years ago

I see you concatenate the character to the sentence as input, but how you express or encode the character?

debjitpaul commented 5 years ago

I used '#'tag as a char attached to the char along with the sentence. Please check src/neural_model/experiment.py sentences.append(char+'#'+sent)

debjitpaul commented 5 years ago

Did you extract the paths?

victorup commented 5 years ago

Yes, my result like this, 432baa70-320b-42e9-93e6-74ae239ccf19_sen2 ['extra RelatedTo interest RelatedTo peer_group RelatedTo status', 'extra RelatedTo additional RelatedTo those_that_have_get RelatedTo status',...] I have seen the "experiment.py". But I don't know how to encode the char. Like "Jervis" from a sentence, how to embed this char. Is there 'Jervis' in the dictionary?

victorup commented 5 years ago

Hello, I know you used '#'tag as a char attached to the char along with the sentence. But I don't know how to encode the char. Like "Jervis" from a sentence, how to embed this char. Is there 'Jervis' in the dictionary? Thank you!

debjitpaul commented 5 years ago

Hi,

You encode it along with the sentence encoder. The system should learn to distinguish between the character and the sentence as it is separated by # tag. But, you can also replace it by PersonX or PersonY.

Best, Debjit

victorup commented 4 years ago

Hello, Would you like to replace the characters in each sentence with PersonX or PersonY? This will have the same word vector of the PersonX or PersonY in different sentences. How could the model distinguish the characters in different sentences?

debjitpaul commented 4 years ago

Same word embeddings. You will expect the neural network to learn to distinguish the characters. You can also use POS or NER tag as a feature.

victorup commented 4 years ago

Thank you! And I find the code "s = g.get_all_shortest_paths(concepts[i], vertex[j], mode='OUT')" in "make_sub_graph.py" is too slow. It sometimes takes 30s each concept. Find all paths sometimes needs more than1000s each sentence. Is there any method to get paths quickly?

victorup commented 4 years ago

Hello, Could you send me your generated subgraph or extracted paths per sentence? I'm so slow to generate. Thank you very much!