Right way to calculate eval scores

hitzkrieg commented 1 year ago

What would be the right way to get the eval results once the training finishes? Should I manually average the last 10% episode scores using the eval json file? (I have occasionally encountered cases when all files are saved after training except eval json) Or should I rely on progress.csv?

MarcCote commented 1 year ago

Good question. I started working on cleaning up this repo code to make it compatible with the recent version ScienceWorld. While cleaning up, I removed the internal scripts we used to extract the scores (see https://github.com/cognitiveailab/drrn-scienceworld/commit/4ed890969f51d580985c83ca8ac22de4443a8343, maybe they can be useful to you). But for now, @PeterAJansen might be better to answer this question.

yukioichida commented 1 year ago

Hi @MarcCote

Speaking about the results from the scienceworld paper, I am studying and trying to replicate some results in the paper (specifically the DRRN result) and I have a question: can I consider the DRRN results as zero-shot learning in test variations?

I am asking about it because the Table 2 description in the paper contains this part: "...Performance for RL agents is averaged over the last 10% of evaluation episodes..." and I am not sure if evaluation episodes here are meaning eval or test variations (in the script training, the default arg is "eval" and not "test").

Also, please correct me if I am missing some point.

PeterAJansen commented 1 year ago

Hi @yukioichida, apologies to be a little slow!

If I'm remembering correctly, the setup for the DRRN was essentially:

train for 100k steps (x8 threads, ~= 800k steps)
every so many training steps (I think 1k or 5k, under various configurations), spawn an evaluation environment set to a randomly selected variation index (drawn from dev for tuning, or test for the final eval), and run some steps in that environment (I think 100).
Once completed all the training steps, take the average performance of the last 10% of the scores on the evaluation set, and report these.

Having this data also allows us to plot the performance vs task curves, like in Figure 2.

I tried to setup this evaluation similar to by best understanding for how the existing text game literature/DRRN models were being used at the time. But, in retrospect, I think the cleanest evaluation would be to do just do something similar to how we evaluated the LLM-based agents:

Train the model up to some number of episodes
Evaluate the model iteratively on every variation within a given task, then average that performance across variations (rather than running for some static number of steps or episodes, randomly starting a new variation when the last finished, as in the DRRN evaluation).

The above protocol would be much cleaner, give an assessment of model performance across all task variations, and also give a fairly direct comparison to the LLM-based models.

I will say, as someone who stared at a very large number of DRRN trajectories during development: It really doesn't appear that the DRRN is learning much of anything, and most of its very modest performance seems to be randomly selecting the occasional action that helps incidental/optional goals (like moving to a new location, or opening a door) rather than actually helping task performance. The DRRN's performance is highest on the pick-and-place tasks (e.g. find a living/non-living thing), where its scores suggest that it successfully completes some of the very permissive picks some/most of the time. Other than that, task performance across the board is generally very low. So, while we should use the best possible protocol and research methods to measure performance, even so I doubt it would change the DRRN performance much/at all, as there doesn't seem to be much signal there to measure in the first place.

cognitiveailab / drrn-scienceworld

Right way to calculate eval scores #6