arcee-ai / DALM

Domain Adapted Language Modeling Toolkit - E2E RAG
https://www.arcee.ai
Apache License 2.0
300 stars 39 forks source link

added Qwen2.5 to generate QA pairs. #96

Closed shamanez closed 2 weeks ago

shamanez commented 2 weeks ago

Integrate Qwen2.5 7B Model for Question Generation

Changes

Rationale

The Qwen2.5 7B model provides more advanced question generation capabilities compared to the previous T5 model. By focusing solely on question generation without answers, we streamline the process for scenarios where RAG is not being performed end-to-end.

How to Run

  1. Ensure you have the required dependencies installed:

    pip install transformers datasets torch
  2. Place your knowledge_dataset.csv file in the same directory as the script. There's a mock one, so don't worry.

  3. Run the script with the following command:

    python question_answer_generation.py \
       --dataset_path=knowledge_dataset.csv \
       --batch_size=8 \
       --sample_size=50 \
       --output_dir=out

    Adjust the batch_size and sample_size as needed. The output_dir specifies where the generated questions will be saved.

  4. The script will process the dataset, generate questions, and save the results in the specified output directory.

Notes

Sriharsha-hatwar commented 2 weeks ago

But, there is a check that seems to be failing?

Jacobsolawetz commented 2 weeks ago

@Crystalcareai for reference this is qa gen for retrieval training on your 2.5 7B rec