Closed GHLgh closed 7 years ago
How about we change it slightly in this way? In the example, retrieve a random document. (Say this: https://github.com/ryanmcdermott/trump-speeches/blob/master/speeches.txt ). Then count all the verbs (POS = VB, VBB, VBD, VBG, VBN, VBZ, VBP) that occur "immediately after" a person (NER = PER). (By "immediately after" Like same sentence, I mean after, same sentence, within window of 3 words.)
What do you think?
It's doable, I can try that.
When you said 3 words, do you mind 3 tokens? I ask about it because punctuations are also counted as tokens, right?
Yeah tokens should be fine.
BTW, we shouldn't send everything altogether to the pipeline. We can split based on new lines and tabs, before sending it to the pipeline.
Also as a general comment, the usage is not easy. We should make it easier to access neighboring tokens somehow. https://github.com/CogComp/sioux/pull/48/files#diff-dc8b50acc65729bc37a3b573f4ab541eR31
Also being able to iterate over a view would be useful IMO.
for ner_token in pipeline.get_ner(doc):
print(ner_token['label'])
Good idea, I can make the class a iterator, then we can get rid of some_view_class.get_cons()
@bhargav how would you want to be easier to access neighboring tokens? If we can iterate the view and find constituent by index, would that be sufficient?
I can make the usage simpler by adding corresponding tokens in the constituent (then we have ner_con['tokens'] == 'tokens of this constituent'
). Right now we have to do some_view.get_cons(key='token')[constituent_index]
@danyaljj Example for the first bullet point in #44
We can close this pr after the example is put in ipython notebook