Open davidleejy opened 1 year ago
Thanks for your report! The example code works well at numpy version 1.21.6, you can temporarily avoid this problem by using numpy 1.21.6. For more details, we will receive a warning at numpy 1.21.6:
/remote-home/shxing/anaconda3/envs/fastnlp/lib/python3.7/site-packages/fastNLP/core/dataset/field.py:77: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
return np.array(contents)
Therefore another solution can be changing source code at fastNLP/core/dataset/field.py:77
from
return np.array(contents)
to
return np.array(contents, dtype=object)
The output is:
[list([1212, 318, 281, 17180, 764]) list([40, 588, 22514, 764])
list([4677, 829, 389, 922, 329, 674, 1535, 764])]
We feel sorry that we didn't take different packages' version into consideration and apologize to you for the inconvenience. We are going to discuss this problem later to provide a better solution in our future version.
I've tried a few ways to use tokenizers with DataSet and DataBundle objects but am not successful.
Basically, just trying to do this:
I've tried associating the tokenizer to the DataSet object but the same exception is encountered:
Python version 3.10, numpy 1.24.1 (are there other python packages whose versions I need to be careful about?)