PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
11.98k stars 2.92k forks source link

load_dataset 对数据类型的处理存在不一致 #2644

Closed x54-729 closed 1 year ago

x54-729 commented 2 years ago

您好!最近我在使用paddlenlp运行官网tutorial的内容,版本为2.3.3,目前遇到了几个问题:

  1. 部分数据集没办法正确识别 load_datasetsplits 参数:
    from paddlenlp.datasets import load_dataset
    train_dataset, val_dataset = load_dataset("PaddlePaddle/dureader_robust", splits=("train", "validation"))

    报错:

    Traceback (most recent call last):
      File "train_qa.py", line 11, in <module>
        train_dataset = load_dataset("PaddlePaddle/dureader_robust", splits=("train", "validation"))
      File "/remote-home/shxing/anaconda3/envs/xsh-paddle/lib/python3.7/site-packages/paddlenlp/datasets/dataset.py", line 201, in load_dataset
        path_or_read_func, name=name, splits=splits, **kwargs)
      File "/remote-home/shxing/anaconda3/envs/xsh-paddle/lib/python3.7/site-packages/paddlenlp/datasets/dataset.py", line 139, in load_from_hf
        for feature in hf_datasets.features.values():
    AttributeError: 'tuple' object has no attribute 'features'

    看了一下源代码似乎是因为没有判断hf_datasets为tuple的情况?

  2. 不同数据集加载后得到的dataset.data类型不同

    train_dataset = load_dataset("PaddlePaddle/dureader_robust", splits="train")
    print(type(train_dataset))
    train_dataset = load_dataset("dureader_robust", splits="train")
    print(type(train_dataset))

    输出:

    <class 'datasets.arrow_dataset.Dataset'>
    <class 'list'>

    这样会产生一个问题,当我试着运行阅读理解任务的tutorial时:

    • 如果我采用 dureader_robust 数据集,那么会无法运行 compute_prediction 函数,因为 dataloader.dataset.data 是一个列表,而 compute_prediction 函数既要从 examples 参数中找到键 id,又要对其进行遍历;同时 compute_prediction 这个函数的文档指出 examples 应该是一个列表,和函数内容矛盾;
    • 如果我采用 PaddlePaddle/dureader_robust 数据集,又会出现新的问题:

      • 如果我不指定 train_dataset.mapnum_workers 参数,处理数据的函数 prepare_train_features 的参数 examples<class 'datasets.arrow_dataset.Dataset'>
        train_dataset = load_dataset("PaddlePaddle/dureader_robust", splits="train")
        def prepare_train_features(examples):
        print(type(examples)) # <class 'datasets.arrow_dataset.Dataset'>
        train_dataset.map(prepare_train_features, batched=True)
      • 如果我设置了 num_workers 多线程想加快速度,则 examples 是一个 list,和前面不一致;
        train_dataset = load_dataset("PaddlePaddle/dureader_robust", splits="train")
        def prepare_train_features(examples):
        print(type(examples)) # <class 'list'>
        train_dataset.map(prepare_train_features, batched=True, num_workers=2)
      • 如果我设置了 num_workers 多线程,并且返回了一个字典,那么 new_data 成员会成为数据的键的列表;

        train_dataset = load_dataset("PaddlePaddle/dureader_robust", splits="train")
        def prepare_train_features(examples):
        contexts = [data[i]['context'] for i in range(len(data))]
        questions = [data[i]['question'] for i in range(len(data))]
        
        tokenized_data = tokenizer(
            questions,
            contexts,
            stride=128,
            max_length=256,
            padding="max_length",
        )
        return tokenized_data
        train_dataset.map(prepare_train_features, batched=True, num_workers=2)
        print(train_dataset.new_data) # ['offset_mapping', 'input_ids', 'token_type_ids', 'overflow_to_sample', 'offset_mapping', 'input_ids', 'token_type_ids', 'overflow_to_sample']
guoshengCS commented 2 years ago

dureader_robust 这个数据集当前存在两种使用方式,一种是使用 datasets 库来加载,如下:

from datasets import load_dataset
train_dataset, val_dataset = load_dataset("PaddlePaddle/dureader_robust", split=("train", "validation"))

一种是使用paddlenlp.datasets 来加载,使用的参数会不太一样

from paddlenlp.datasets import load_dataset
train_dataset, val_dataset = load_dataset("dureader_robust", splits=("train", "dev"))

当前更推荐第一种,后面也会主要使用第一种,也在做example的迁移工作,文档相关内容暂时还未迁移完成,抱歉造成困扰

guoshengCS commented 2 years ago

不同数据集加载后得到的dataset.data类型不同

这个也是因为使用"PaddlePaddle/dureader_robust"加载时实际上会使用datasets库来加载,而使用"dureader_robust"时会使用paddlenlp内的来加载,返回的数据类型会不太一样

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。