Open DTchebotarev opened 6 years ago
Hi @DTchebotarev , this is not currently a feature, but I appreciate that padding sequences is a common task in deep learning. I've been dragging my feet on getting DL models into textacy, but when I do, I'd expect to include useful adjacent functionality like this as well.
Padding sequences is common even not in deep learning. It gives more context to an n-gram (i.e. it states that it is text-initial).
I recently implemented something like this in a keyterm extraction algorithm: https://github.com/chartbeat-labs/textacy/blob/794be5960b1126b5d183a0c8fc9f05c4fc004748/textacy/keyterms.py#L247-L251
Unlike extract.ngrams()
, this method produces Tuple[Token]
rather than Span
objects, so it doesn't work in the context of to_terms_list()
. But maybe it's helpful.
Is it possible to add context to ngram extraction?
For example, currently running
list(textacy.Doc('I like green eggs and ham.').to_terms_list(ngrams=3,as_strings=True))
returns a list
['-PRON- like green', 'like green egg', 'egg and ham']
But I would ideally like to have the option to specify something like
list(textacy.Doc('I like green eggs and ham.').to_terms_list(ngrams=3,as_strings=True, left_pad=True, right_pad=True))
and have it return something along the lines of
['<s2> <s1> -PRON', '<s1> -PRON- like' ,'-PRON- like green', 'like green egg', 'egg and ham', 'and ham </s1>', 'ham </s1> </s2>]
I don't think this is possible in textacy currently, so I guess this is a feature request.
Also any ideas for a workaround are greatly appreciated :)