TromboneDavies / PolarOps

0 stars 0 forks source link

Magically conjure up more training data using bootstrapping #60

Open divilian opened 3 years ago

divilian commented 2 years ago

The bootstrap.py file (in keras) is a first cut at doing bootstrapping. Instructions are in comments within. We'll take a look at the results of this in tomorrow's (7/12) meeting.

divilian commented 2 years ago

General process:

  1. Google for "best practices on bootstrapping text data"
  2. Run the bootstrap.py which will generate two additional .csv files, one for "presumedly strongly non-polar" and one for "presumably strongly polar".
  3. Add these into the hand-tagged training data, and use that combined bigger sample to run through validate.py, and see if the accuracy increases.
  4. Lather, rinse, repeat.
  5. Make a cool plot (which will be a slide for symposium) showing accuracy vs. number of bootstrap samples added. If the oracles are right, this should increase accuracy at first, and then decline, and so there should be a "sweet spot" of additional machine-labeled training data.
akochans commented 2 years ago

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.63.7998&rep=rep1&type=pdf The algorithm iterates until it converges to a point where θˆ (the classifier) does not change from one iteration to the next. Algorithmically, we determine that convergence has occurred by observing a below-threshold change in the log-probability of the parameters (Equation 2.11), which is the height of the surface on which EM is hill-climbing.

https://stats.stackexchange.com/questions/33300/determining-sample-size-necessary-for-bootstrap-method-proposed-method Also people often ask "How many bootstrap samples do I need to get a good Monte Carlo approximation to the bootstrap result?" My suggested answer to that question is to keep increasing the size until you get convergence. No one number fits all problems.

https://web.archive.org/web/20170110141727/http://web.stanford.edu/class/psych252/tutorials/doBootstrapPrimer.pdf (this is a psychological paper) How many observations should we resample? A good suggestion is the original sample size. 
 How many iterations should we run? The field has to come to a consensus to a conventional number, but in the meantime, I would suggest something like 5,000 (which my statistics lecturer proposed too when I learnt bootstrapping). (Technically, one would keep increasing the iteration size until the simulation converges, i.e. when the results from a run doesn’t change if you add more iterations.)