WowCZ / LongMIT

LongMIT: Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets
32 stars 0 forks source link

LongMIT: Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets

[📑ArXiv] | [🤗HuggingFace]

🤗 Download LongMIT Datasets

def download_longmit_datasets(dataset_name: str, save_dir: str):
    qa_pairs = []
    dataset = load_dataset(dataset_name, split='train', cache_dir=HFCACHEDATASETS, trust_remote_code=True)
    for d in dataset:
        all_docs = d['all_docs']

        if d['type'] in ['inter_doc', 'intra_doc']:
            if d['language'] == 'en':
                content_key = 'Passage {pi}:\n'
                # with CoT
                instruction_format = 'Answer the question based on the given passages.\n\nThe following are given passages.\n{concat_content}\n\nAnswer the question based on the given passages and provide a complete reasoning process.\nQuestion:{q}\nAnswer:'
            else:
                content_key = '文章 {pi}:\n'
                # with CoT
                instruction_format = '根据给定的段落回答问题。\n\n以下是给定的段落。\n{concat_content}\n\n请结合上面材料回答以下问题,并且给出完整的推理过程。\n问题:{q}\n答案:'
        else:
            if d['language'] == 'en':
                content_key = 'Passage {pi}:\n'
                instruction_format = 'Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{concat_content}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\nQuestion:{q}\nAnswer:'
            else:
                content_key = '文章 {pi}:\n'
                instruction_format = '根据给定的段落回答问题。只给答案,不要输出任何其他单词。\n\n以下是给定的段落。\n{concat_content}\n\n请结合上面材料回答以下问题。只给答案,不要输出任何其他单词。\n问题:{q}\n答案:'

        concat_content = '\n'.join([content_key.format(pi=di+1)+doc['content'] for di, doc in enumerate(all_docs)])
        question =  d['question']
        answer = d['answer']

        qa_pairs.append(json.dumps(
            {
                'prompt': instruction_format.format(concat_content=concat_content, q=question),
                'output': answer
            }
        )+'\n')

    if not os.path.exists(save_dir):
        os.makedirs(save_dir)

    with open(os.path.join(save_dir, 'train.jsonl'), 'w') as fw:
        fw.write(''.join(qa_pairs))

🍴 Build Your Custom LongMIT Datasets

🌏 Environments

git https://github.com/WowCZ/LongMIT.git
cd LongMIT
git clone https://github.com/WowCZ/InternEmbedding.git
pip install -r requirements.txt

🚀 Crafting Long Context MIT

1. Organize the private text corpus with embedding models

Step-1: Embedding source text corpus:
python doc_process/embed_doc.py --config doc_process/config/embedding/embedding_example.yaml --num_process_nodes 8
Step-2: Build document graph with approximated knn
python doc_process/build_doc_graph.py --command train_index --config doc_process/config/faiss/example_knn.yaml --xb example
wait

python doc_process/build_doc_graph.py --command index_shard --config doc_process/config/faiss/example_knn.yaml --xb example 
wait

python doc_process/build_doc_graph.py --command search --config doc_process/config/faiss/example_knn.yaml --xb example
wait
Step-3: Traverse document graph
python doc_process/traverse_doc_graph.py

2. Multi-Agent-Driven LongMIT Data Synthesis

python agent/distribute_run_agents.py --config agent/configs/longqa_example.yaml

🧷 Citation

If you find the content of this repo useful in your work, please cite it as follows via \usepackage{biblatex}:

@article{chen2024essential,
  title={What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices},
  author={Chen, Zhi and Chen, Qiguang and Qin, Libo and Guo, Qipeng and Lv, Haijun and Zou, Yicheng and Che, Wanxiang and Yan, Hang and Chen, Kai and Lin, Dahua},
  journal={arXiv preprint arXiv:2409.01893},
  year={2024}
}