Closed martincjespersen closed 2 years ago
They way augmenty is set up now it only allows augmentation within sample, i.e. for :
"Augmenty is a wonderful tool for augmentation. Augmentation is a wonderful tool"
you could get:
"Augmenty is a wonderful tool for augmentation.
"Augmentation is a wonderful tool"
"Augmenty is a wonderful tool for augmentation. Augmentation is a wonderful tool"
But never:
Augmenty is a wonderful tool for augmentation. Augmentation is a wonderful tool
for obtaining higher performance on limited data.
I still think the augmenter is relevant though. The other point would require #14, which is a known problem with spaCy augmentation setup as it currently stands.
Will be added in #50
Added in newest version
A paragraf subset augmentation which can work on token and sentence level. It will sample a random percentage of included coherent tokens/sentences and a random token/sentence start position ensuring the former constraint is maintained. The augmenter needs to handle annotated entities and avoid breaking them.
Input arguments: level: how often to apply augmenter min_paragraf: Minimum percentage of tokens or sentences to include. Ie. 4 sentences with min_paragraf=0.5 means it as a minimum includes 2 sentences. sentence_level: Boolean to define if token or sentence level to define
Example - sentence level
Example outputs:
The first section:
The middle section:
The middle section:
Additional thoughts:
Possibly addition of a reverse augmenter, eg. removing a coherent section of tokens/sentences.