brando90 / MathNet-large-scale-Mathematics-Dataset-for-Machine-Learning

1 stars 0 forks source link

Research maybe language/linguistic techniques that could help to generate more data #30

Open brando90 opened 7 years ago

brando90 commented 7 years ago

data augmentations techniques are important for this task. Right now we have:

1) variation through variable names 2) variation through names, first, last, cities, streets, etc 3) perg 4) choiceg 5) variation through random choice of numerical values

to get an intuition the number of examples per class for imagenet: http://image-net.org/about-stats

seems that they all are around above 10K for one class. So maybe it would be nice if the framework could somehow aid the user to have at least that many example per class?

Are there other linguistic ways of changing sentences (maybe syntax) that keeps the same meaning but does alter the way the sentence looks?

For example, for augmentation of images its easy, rotations already provide useful way to do augment data sets easily.

brando90 commented 7 years ago

we could hard code some data base of (real) synonyms (not context dependent), but it seems risky to do this.