explosion / spaCy

šŸ’« Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.84k stars 4.38k forks source link

spacy.Corpus.v1 never shuffles the data #9171

Closed RajK853 closed 3 years ago

RajK853 commented 3 years ago

While going through the source code of the spacy.Corpus.v1, it seems like reference docs are never shuffled.

https://github.com/explosion/spaCy/blob/master/spacy/training/corpus.py

Registration part of Corpus.

@util.registry.readers("spacy.Corpus.v1")
def create_docbin_reader(
    path: Optional[Path],
    gold_preproc: bool,
    max_length: int = 0,
    limit: int = 0,
    augmenter: Optional[Callable] = None,
) -> Callable[["Language"], Iterable[Example]]:
    if path is None:
        raise ValueError(Errors.E913)
    util.logger.debug(f"Loading corpus from path: {path}")
    return Corpus(
        path,
        gold_preproc=gold_preproc,
        max_length=max_length,
        limit=limit,
        augmenter=augmenter,
    )

Default value of shuffle in the Corpus class:

class Corpus:
    """Iterate Example objects from a file or directory of DocBin (.spacy)
    formatted data files.
    path (Path): The directory or filename to read from.
    gold_preproc (bool): Whether to set up the Example object with gold-standard
        sentences and tokens for the predictions. Gold preprocessing helps
        the annotations align to the tokenization, and may result in sequences
        of more consistent length. However, it may reduce run-time accuracy due
        to train/test skew. Defaults to False.
    max_length (int): Maximum document length. Longer documents will be
        split into sentences, if sentence boundaries are available. Defaults to
        0, which indicates no limit.
    limit (int): Limit corpus to a subset of examples, e.g. for debugging.
        Defaults to 0, which indicates no limit.
    augment (Callable[Example, Iterable[Example]]): Optional data augmentation
        function, to extrapolate additional examples from your annotations.
    shuffle (bool): Whether to shuffle the examples.
    DOCS: https://spacy.io/api/corpus
    """

    def __init__(
        self,
        path: Union[str, Path],
        *,
        limit: int = 0,
        gold_preproc: bool = False,
        max_length: int = 0,
        augmenter: Optional[Callable] = None,
        shuffle: bool = False,
    ) -> None:
        self.path = util.ensure_path(path)
        self.gold_preproc = gold_preproc
        self.max_length = max_length
        self.limit = limit
        self.augmenter = augmenter if augmenter is not None else dont_augment
        self.shuffle = shuffle

    def __call__(self, nlp: "Language") -> Iterator[Example]:
        """Yield examples from the data.
        nlp (Language): The current nlp object.
        YIELDS (Example): The examples.
        DOCS: https://spacy.io/api/corpus#call
        """
        ref_docs = self.read_docbin(nlp.vocab, walk_corpus(self.path, FILE_TYPE))
        if self.shuffle:
            ref_docs = list(ref_docs)
            random.shuffle(ref_docs)

        if self.gold_preproc:
            examples = self.make_examples_gold_preproc(nlp, ref_docs)
        else:
            examples = self.make_examples(nlp, ref_docs)
        for real_eg in examples:
            for augmented_eg in self.augmenter(nlp, real_eg):
                yield augmented_eg

It seems the shuffle parameter cannot be passed to the spacy.Corpus.v1 from the config file and as a result the shuffle attribute is always set to False by default resulting in the reference docs not being shuffled.

adrianeboyd commented 3 years ago

Hi, there's a shuffle for each epoch for non-streamed corpora when training batches are created in the training loop here:

https://github.com/explosion/spaCy/blob/aba6ce3a43f3996388231ed423cba45967c42f14/spacy/training/loop.py#L305-L324

Edited to add: I added the shuffle option to Corpus when we were considering some other alternatives for the streamed corpora, and it didn't seem like a problem to go ahead and leave it in as an option for other uses of Corpus. The default training loop has always shuffled the training examples for finite corpora for each epoch.

RajK853 commented 3 years ago

Ah now it makes sense šŸ˜„ Thanks for the clarification.

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.