chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.19k stars 246 forks source link

Context for ngrams? #171

Open DTchebotarev opened 6 years ago

DTchebotarev commented 6 years ago

Is it possible to add context to ngram extraction?

For example, currently running

list(textacy.Doc('I like green eggs and ham.').to_terms_list(ngrams=3,as_strings=True))

returns a list

['-PRON- like green', 'like green egg', 'egg and ham']

But I would ideally like to have the option to specify something like

list(textacy.Doc('I like green eggs and ham.').to_terms_list(ngrams=3,as_strings=True, left_pad=True, right_pad=True))

and have it return something along the lines of

['<s2> <s1> -PRON', '<s1> -PRON- like' ,'-PRON- like green', 'like green egg', 'egg and ham', 'and ham </s1>', 'ham </s1> </s2>]

I don't think this is possible in textacy currently, so I guess this is a feature request.

Also any ideas for a workaround are greatly appreciated :)

bdewilde commented 6 years ago

Hi @DTchebotarev , this is not currently a feature, but I appreciate that padding sequences is a common task in deep learning. I've been dragging my feet on getting DL models into textacy, but when I do, I'd expect to include useful adjacent functionality like this as well.

jnothman commented 5 years ago

Padding sequences is common even not in deep learning. It gives more context to an n-gram (i.e. it states that it is text-initial).

bdewilde commented 5 years ago

I recently implemented something like this in a keyterm extraction algorithm: https://github.com/chartbeat-labs/textacy/blob/794be5960b1126b5d183a0c8fc9f05c4fc004748/textacy/keyterms.py#L247-L251

Unlike extract.ngrams(), this method produces Tuple[Token] rather than Span objects, so it doesn't work in the context of to_terms_list(). But maybe it's helpful.