chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

"IndexError: list assignment index out of range" performing `delete_words` in a text #308

Closed jonsnowseven closed 4 years ago

jonsnowseven commented 4 years ago

Hello.

I am having an issue augmenting some text (namely, deleting some random words).

Code to reproduce the error:

import random
import math
from functools import partial
from textacy import make_spacy_doc
from textacy.augmentation.augmenter import Augmenter
from textacy.augmentation.transforms import (
    delete_words,
    insert_word_synonyms,
    substitute_word_synonyms,
    swap_words,
)

random.seed(42)

doc = make_spacy_doc(
    """My name is NAME and I am a NAME NAME with NAME, 
    looking after requirement fulfillment for our clients in the NAME. 
    We provide top skilled resources in NAME/ Non NAME, NAME, NAME, NAME NAME and 
    NAME, NAME, NAME, NAME NAME, and others roles. My company, NAME NAME NAME is a 
    NAME / NAME certified staffing supplier headquartered out of NAME, NAME. 
""".strip(),
    lang="en",
)

tfs = [
    partial(delete_words, num=math.ceil(0.05 * len(doc))),
]
augmenter = Augmenter(tfs, num=None)
augmenter.apply_transforms(doc)

Error:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-11-a8a4acb0dba3> in <module>
     27 ]
     28 augmenter = Augmenter(tfs, num=None)
---> 29 augmenter.apply_transforms(doc)

~/anaconda3/envs/.../lib/python3.6/site-packages/textacy/augmentation/augmenter.py in apply_transforms(self, doc, **kwargs)
    105             else:
    106                 for tf in tfs:
--> 107                     aug_toks = tf(aug_toks)
    108             new_nested_aug_toks.append(aug_toks)
    109         return self._make_new_spacy_doc(new_nested_aug_toks, lang)

~/anaconda3/envs/.../lib/python3.6/site-packages/textacy/augmentation/transforms.py in delete_words(aug_toks, num, pos)
    231                     pos=prev_tok.pos,
    232                     is_word=prev_tok.is_word,
--> 233                     syns=prev_tok.syns,
    234                 )
    235         else:

IndexError: list assignment index out of range
bdewilde commented 4 years ago

Hi @jonsnowseven , thanks for the detailed code example! It looks like you got snagged by a bug that requires unfortunate bad luck, owing to random.seed(42) — things are fine with random.seed(41), but of course that's not a special number :) I believe I have a fix, and will commit it to the dev branch shortly. I'm looking to publish a new release of textacy sometime next week, so the fix should be "live" shortly.