Closed andreaskuster closed 4 years ago
Have a look at the comments from version 1 too:
The biggest thing that I would say overall is that from reading through your version 1 writeup, there are a *ton* of results in there that are still marked as to-do. Have some of those already been done, or are all of them still left to run? I would suggest narrowing the scope of your project to focus on only some of the original paper's experiments— that's fine! We'd much rather see a thorough investigation of one important set of original experiments from the paper than a rushed investigation of every experiment from the original paper. Be sure, too, that as you're running your experiments, you're collecting all the information about them that the project instructions (particularly the computational requirement section) asks you to report (https://docs.google.com/document/d/1Dd9_VQHXseiroirUI-1rBDS6mJEUHiDQ7ND321O29W8/edit#bookmark=id.g9my7okeo4i3), so that you'll have that information to plug into your writeup.
For the final writeup, please write up your own summary of what the original paper did instead of copying the paper's abstract into your writeup; that will help to introduce your report without making it sound like the experiments that you're actually replicating are some of your extensions.
The "hypothesis" section should also write out the *original paper's* hypotheses (in your own words) that your replicated experiments test, in such a way that it's possible to look at your hypotheses, look at your results without comparing them to the original paper's results, and give a "yes" or "no" answer to whether they hold up. (Including the original experiments' results in your table is fine, but a reader shouldn't *need* to look at those parts of the tables to determine whether the results of your experiments support your hypotheses.)
The OOV procedure that you describe is somewhat untraditional; the risk of having (potentially) each unique token (character, in this case) in your training set contribute to the embedding for OOV is that that trained OOV embedding will therefore reflect (in large part) very common characters, which will not be the case when you're actually facing unknown characters at test time. Therefore, a much more common procedure is to select only the rarest tokens in your training set, and change those all to OOV, so that the model's learned OOV representation will reflect character rarity.
I'm excited to see what you find! Let me know if you have any questions.
This is the structure we have to follow:
Your report will include the following. The amount of work put into each section below could be different for different reports. Generally, focus on what future researchers or practitioners would find useful for reproducing or building upon the paper you choose.
1. Contributions
A clear list of the scientific hypotheses evaluated in the original paper. Some papers don't make this super clear, so it can take a couple readings of the paper to understand.
A list of the hypotheses evaluated in your report. This will likely overlap with 1a.
A description of the experiments in the report, and how those experiments support the hypotheses in 1b.
2. Code
If writing your own code, make sure it is documented and easy to use (this project is about reproducibility!). Include a link to a github repository which can be installed and run with a few lines in bash on department machines. Include a description of how difficult the algorithms were to implement.
If using public code from the original repository, more of your energy will go into running additional experiments, such as hyperparameter optimization, ablations, or evaluation on new datasets (see below). However, note that it’s not always trivial to get a public code release working!
3. Experiment reproduction.
Model description (type of model, total number of parameters, etc.).
Dataset description (training / validation / test set sizes, label distribution, and other easily explained information a typical reader might want).
Hyperparameters: A clear description of the hyperparameters used in the experiments. While some hyperparameters will be specific to a particular model, there are many that are common (learning rate, dropout, size of each layer in the model, the total number of parameters, etc.). Lean towards reporting even uninteresting hyperparameters. You can see an example of how to do this in the appendix here.
For each experiment, a description of how it does or doesn't reproduce the claims in the original paper.
4. Experiments beyond the original paper. The amount you do will depend on how smoothly the above parts of the project went. Examples include:
Hyperparameter search: you could assess the sensitivity of the findings to one or more hyperparameters, or measure the variance of the evaluation score due to randomness in initial parameters. If you do hyperparameter search, be sure to describe the method used (grid search, uniform sampling, Bayesian optimization, etc.). At least include the min, max, mean / median, and variance of the performance; further sensitivity analysis (e.g. plots) could be warranted.
Varying amounts of data: often a paper will only include the performance of a model after training on the full training set. You could evaluate a model (on validation data, not test data) with varying amounts of training data. An example of an interesting conclusion here could be that the baselines from the original paper outperform the new model when trained on a small amount of data, but eventually the new model outperforms the baseline.
Evaluate on a new dataset: evaluate if the conclusion as to which model performs best (as reported in the original paper) holds on a different dataset.
Ablations: some papers introduce many new ideas but don't evaluate the contribution of each individually. A valuable study could evaluate each component individually.
5. Computational requirements
All reports should include relevant information about computational requirements. The requirements for the original paper should be (roughly) estimated. For the experiments in the report, include at least the type of hardware used, the average runtime for each approach, the total number of trials, the total number of (GPU) hours used, number of training epochs, and any other relevant info.
Some authors will have had access to infrastructure that is way out of your budget; don’t choose such a paper!
6. Discussion and recommendations for reproducibility
A section which discusses the larger implications of the experimental results, whether the original paper was reproducible, and if it wasn’t, what factors made it irreproducible.
A set of recommendations to the original authors or others who work in this area for improving reproducibility.
In order to not over-expand the README.md
file in the root directory, I moved content (i.e. hyperparameter search, code usage, computation requirement and human effort, ..) to pos/README.md
Please make sure to include this in the final report too.
Sounds good, and just started a branch with the initial version of the final report.
Make sure to follow all the point from the email: