fastnlp / fastNLP

fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.
https://gitee.com/fastnlp/fastNLP
Apache License 2.0
3.05k stars 451 forks source link

How to use tokenizers with DataSet and DataBundle? #449

Open davidleejy opened 1 year ago

davidleejy commented 1 year ago

I've tried a few ways to use tokenizers with DataSet and DataBundle objects but am not successful.

Basically, just trying to do this:

# Initialize DataSet object `ds` with data.
# Initialize DataBundle object with DataSet object `ds`.
# Define tokenizer.
# Associate tokenizer with field in DataSet or DataBundle object.
# Hope to see tokenizer work when batches of data are extracted from DataSet object.
from fastNLP import DataSet
from fastNLP import Vocabulary
from fastNLP.io import DataBundle
from functools import partial
from transformers import GPT2Tokenizer

data = {'idx': [0, 1, 2],  
        'sentence':["This is an apple .", "I like apples .", "Apples are good for our health ."],
        'words': [['This', 'is', 'an', 'apple', '.'], 
                  ['I', 'like', 'apples', '.'], 
                  ['Apples', 'are', 'good', 'for', 'our', 'health', '.']],
        'num': [5, 4, 7]}

dataset = DataSet(data)    # Initialize DataSet object with data.

data_bundle = DataBundle(datasets={'train': dataset})    # Initialize DataBundle object

# Define tokenizer:
tokenizer_in = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer_in.pad_token, tokenizer_in.padding_side = tokenizer_in.eos_token, 'left'
tokenizer_in_fn = partial(tokenizer_in.encode_plus, padding=True, return_attention_mask=True)
print(tokenizer_in_fn)       # ensure that settings are as expected.

# Associate tokenizer with field:
data_bundle.apply_field_more(tokenizer_in_fn, field_name='sentence', progress_bar='tqdm')

print(ds[0:3])
# Gives:
# +-----+----------------+----------------+-----+----------------+--------------------+--------+
# | idx | sentence       | words          | num | input_ids      | attention_mask     | length |
# +-----+----------------+----------------+-----+----------------+--------------------+--------+
# | 0   | This is an ... | ['This', 'i... | 5   | [1212, 318,... | [1, 1, 1, 1, 1]... | 5      |
# | 1   | I like appl... | ['I', 'like... | 4   | [40, 588, 2... | [1, 1, 1, 1]       | 4      |
# | 2   | Apples are ... | ['Apples', ... | 7   | [4677, 829,... | [1, 1, 1, 1, 1,... | 8      |
# +-----+----------------+----------------+-----+----------------+--------------------+--------+

# Try to obtain batch data:
ds = data_bundle.get_dataset('train')
print(ds['sentence'].get([0,1,2])) # okay, no problem.
print(ds['input_ids'].get([0,1,2])) # throws exception.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[66], line 1
----> 1 print(ds['input_ids'].get([0,1,2]))

File ~/condaenvs/bbt-hf425-py310/lib/python3.10/site-packages/fastNLP/core/dataset/field.py:77, in FieldArray.get(self, indices)
     75 except BaseException as e:
     76     raise e
---> 77 return np.array(contents)

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (3,) + inhomogeneous part.

I've tried associating the tokenizer to the DataSet object but the same exception is encountered:

# ds is initialized DataSet object
ds.apply_field_more(tokenizer_in_fn, field_name='sentence', progress_bar='tqdm')
ds['input_ids'].get([0,1,2])  # throws same exception as above.

Python version 3.10, numpy 1.24.1 (are there other python packages whose versions I need to be careful about?)

x54-729 commented 1 year ago

Thanks for your report! The example code works well at numpy version 1.21.6, you can temporarily avoid this problem by using numpy 1.21.6. For more details, we will receive a warning at numpy 1.21.6:

/remote-home/shxing/anaconda3/envs/fastnlp/lib/python3.7/site-packages/fastNLP/core/dataset/field.py:77: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  return np.array(contents)

Therefore another solution can be changing source code at fastNLP/core/dataset/field.py:77 from

return np.array(contents)

to

return np.array(contents, dtype=object)

The output is:

[list([1212, 318, 281, 17180, 764]) list([40, 588, 22514, 764])
 list([4677, 829, 389, 922, 329, 674, 1535, 764])]

We feel sorry that we didn't take different packages' version into consideration and apologize to you for the inconvenience. We are going to discuss this problem later to provide a better solution in our future version.