alvinwmtan / dev-bench

5 stars 4 forks source link

Questions of several testing tasks data #3

Open ShawnKing98 opened 3 weeks ago

ShawnKing98 commented 3 weeks ago

Hi, I'm trying to test my own model on DevBench and look at the difference between model prediction & human prediction. I'm able to generate the embedding features and prediction logits by running eval.py, but I'm having some troubles while parsing the human prediction data. Could you please help me with this? The details are listed below:

  1. For the TROG task, the number of trials in this codebase (N=78) is far less than what you mentioned in the paper (N=514), is there data missing or is the rest of the data unable to be released due to privacy constraints?
  2. For the Viz Vocab task, there are 119 testing samples in the manifest.csv that go through the model, but only 108 human trials data in your provided human.csv; neither is consistent with what you wrote in the paper (N=1780). Is there data missing or am I getting something wrong?
  3. For the THINGS task, again the number of samples (N=1854) is inconsistent with the number in your paper (N=12340), but is consistent with the original paper revealing interpretable object representations from human behavior. May I know how you calculated the number and samples for this task?
  4. For the VOC task, it seems that the human data is stored in the human.rds file. I guess it's an R file and tried to parse it using some python package, but failed to do so. Since there are many people who are familiar with python but not with R, is it possible for you to provide an alternative file that is readable by python? I'll really appreciate that!
  5. For the WAT task, I'm totally lost in retrieving human data. I can see there's an entwisle_norms.csv file and several Cue_Target_Pairs files, but I have no clue what they mean. Could you please elaborate on the format of these human performance data file, or provide a code template that is able to parse the human data?

Thank you very much for your patience, and I appreciate any help you can provide!

alvinwmtan commented 3 weeks ago

Hello! Thank you for getting in touch. For Q1–3, note that the Ns given in the paper reflect number of human participants rather than number of trials. Additionally:

  1. (NA)
  2. You're right that there are only 108 human trials; the additional 11 were pilot/testing trials that weren't included in the final human data. These were dropped in the evaluation, although we hope to potentially use them in the future (which is why they were not excluded from the manifest).
  3. (NA)
  4. Yes—I'm currently working on porting all the analyses to Python, and that will include storing the data in a .npy file, so keep an eye out for that update!
  5. The data are a set of cue–target response counts for children (entwisle_norms.csv) and adults (in adult/). For each cue, we look at the distribution over all targets, and compare with model-rated similarities between the cue and each target. As in (4), I'm hoping to provide a Python version of the analysis which should make it easier to use.