push decontaminator code

I developed a pipeline to extract similar questions from the training data corresponding to the test data.

We can use the following command to retrieve similar questions, which will then be saved back to the original_test.jsonl file.

python nemo_skills/evaluation/retrieve_similar_question.py \
  train_jsonl_files=/home/wedu/workspace/gsm8k-augmentation-questions/output*.jsonl \
  test_jsonl_files=/home/wedu/workspace/NeMo-Skills/datasets/gsm8k/original_test.jsonl \
  model=multi-qa-MiniLM-L6-cos-v1 \
  device=cuda

I also created two yaml files for the LLM to determine whether the question from the test set and the candidate question are rephrased versions of each other. nemo_skills/inference/prompt/openai/math-detect-few-shot.yaml nemo_skills/inference/prompt/openai/math-detect-zero-shot.yaml

We can leverage the generate_solutions.py code if we host our Llama-405B model as outlined below:

python nemo_skills/inference/generate_solutions.py \
    output_file=/home/wedu/workspace/NeMo-Skills/datasets/gsm8k/test_405b_few_shot_gsm8k-augmentation-questions.jsonl \
    +prompt=openai/math-detect-few-shot \
    ++dataset=gsm8k \
    ++split_name=original_test \
    ++max_samples=5000 \
    ++server.server_type=tensorrt_llm \
    ++server.host=batch-block1-1075 \
    ++sandbox.host=batch-block1-1075 \
    ++batch_size=512

I also created a contaminator type similar to our math-grader if we want to use GPT-4 directly. We can also use GPT-4 using the following command:

python nemo_skills/evaluation/evaluate_results.py \
    prediction_jsonl_files=/home/wedu/workspace/NeMo-Skills/datasets/gsm8k/test_405b_zero_shot.jsonl  \
    eval_type=contaminator \
    ++eval_config.grading_type=llm \
    ++eval_config.grading_config.skip_filled=False \
    ++eval_config.grading_config.batch_size=50

Kipok / NeMo-Skills

push decontaminator code #88