WowCZ / LongMIT

LongMIT: Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets
32 stars 0 forks source link

LongMIT: Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets

[📑ArXiv] | [🤗HuggingFace]

🤗 Download LongMIT Datasets

def download_longmit_datasets(dataset_name: str, save_dir: str):
    qa_pairs = []
    dataset = load_dataset(dataset_name, split='train', cache_dir=HFCACHEDATASETS, trust_remote_code=True)
    for d in dataset:
        all_docs = d['all_docs']

        if d['type'] in ['inter_doc', 'intra_doc']:
            if d['language'] == 'en':
                content_key = 'Passage {pi}:\n'
                # with CoT
                instruction_format = 'Answer the question based on the given passages.\n\nThe following are given passages.\n{concat_content}\n\nAnswer the question based on the given passages and provide a complete reasoning process.\nQuestion:{q}\nAnswer:'
                content_key = '文章 {pi}:\n'
                # with CoT
                instruction_format = '根据给定的段落回答问题。\n\n以下是给定的段落。\n{concat_content}\n\n请结合上面材料回答以下问题,并且给出完整的推理过程。\n问题:{q}\n答案:'
            if d['language'] == 'en':
                content_key = 'Passage {pi}:\n'
                instruction_format = 'Answer the question based on the given passages. Only give me the answer and do not output any other words.\n\nThe following are given passages.\n{concat_content}\n\nAnswer the question based on the given passages. Only give me the answer and do not output any other words.\nQuestion:{q}\nAnswer:'
                content_key = '文章 {pi}:\n'
                instruction_format = '根据给定的段落回答问题。只给答案,不要输出任何其他单词。\n\n以下是给定的段落。\n{concat_content}\n\n请结合上面材料回答以下问题。只给答案,不要输出任何其他单词。\n问题:{q}\n答案:'

        concat_content = '\n'.join([content_key.format(pi=di+1)+doc['content'] for di, doc in enumerate(all_docs)])
        question =  d['question']
        answer = d['answer']

                'prompt': instruction_format.format(concat_content=concat_content, q=question),
                'output': answer

    if not os.path.exists(save_dir):

    with open(os.path.join(save_dir, 'train.jsonl'), 'w') as fw:

🍴 Build Your Custom LongMIT Datasets

🌏 Environments

cd LongMIT
git clone
pip install -r requirements.txt

🚀 Crafting Long Context MIT

1. Organize the private text corpus with embedding models

Step-1: Embedding source text corpus:
python doc_process/ --config doc_process/config/embedding/embedding_example.yaml --num_process_nodes 8
Step-2: Build document graph with approximated knn
python doc_process/ --command train_index --config doc_process/config/faiss/example_knn.yaml --xb example

python doc_process/ --command index_shard --config doc_process/config/faiss/example_knn.yaml --xb example 

python doc_process/ --command search --config doc_process/config/faiss/example_knn.yaml --xb example
Step-3: Traverse document graph
python doc_process/

2. Multi-Agent-Driven LongMIT Data Synthesis

python agent/ --config agent/configs/longqa_example.yaml

🧷 Citation

If you find the content of this repo useful in your work, please cite it as follows via \usepackage{biblatex}:

  title={What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices},
  author={Chen, Zhi and Chen, Qiguang and Qin, Libo and Guo, Qipeng and Lv, Haijun and Zou, Yicheng and Che, Wanxiang and Yan, Hang and Chen, Kai and Lin, Dahua},
  journal={arXiv preprint arXiv:2409.01893},