Made improvements to the project, including adding a .gitignore file, enhancing the README.md (documentation on how to run the project), specifying dependencies in requirements.txt, fixing bugs in dataset generation scripts (generate_dataset.py and generate_dataset_eval.py), and adding print statements in train.py for better visualization.
I'm still facing bugs while executing the train.py file. I've created an issue in the original repository to see if anyone is facing the same problem.
I'll investigate further on how to solve the problem.
Bug:
Traceback (most recent call last):
File "/Users/igorlimarochaazevedo/Programming/UTokyo/LabSuzumura/OpenP5-fork/src/train.py", line 221, in <module>
File "/Users/igorlimarochaazevedo/Programming/UTokyo/LabSuzumura/OpenP5-fork/src/train.py", line 135, in main
File "/Users/igorlimarochaazevedo/Programming/UTokyo/LabSuzumura/OpenP5-fork/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/Users/igorlimarochaazevedo/Programming/UTokyo/LabSuzumura/OpenP5-fork/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/Users/igorlimarochaazevedo/Programming/UTokyo/LabSuzumura/OpenP5-fork/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3097, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "/Users/igorlimarochaazevedo/Programming/UTokyo/LabSuzumura/OpenP5-fork/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3493, in _map_singl
e
writer.write_batch(batch)
File "/Users/igorlimarochaazevedo/Programming/UTokyo/LabSuzumura/OpenP5-fork/venv/lib/python3.9/site-packages/datasets/arrow_writer.py", line 555, in write_batch
arrays.append(pa.array(typed_sequence))
File "pyarrow/array.pxi", line 243, in pyarrow.lib.array
File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol
File "/Users/igorlimarochaazevedo/Programming/UTokyo/LabSuzumura/OpenP5-fork/venv/lib/python3.9/site-packages/datasets/arrow_writer.py", line 189, in __arrow_arra
y__
out = pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))
File "pyarrow/array.pxi", line 327, in pyarrow.lib.array
File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: cannot mix list and non-list, non-null values
Issue description
I've executed the scripts generate_dataset.py and enerate_dataset_eval.py for the Electronics and Beauty datasets. As a result, four files were generated. Two for each dataset, as shown below:
Summary:
Made improvements to the project, including adding a
.gitignore
file, enhancing theREADME.md
(documentation on how to run the project), specifying dependencies inrequirements.txt
, fixing bugs in dataset generation scripts (generate_dataset.py
andgenerate_dataset_eval.py
), and adding print statements intrain.py
for better visualization.I'm still facing bugs while executing the
train.py
file. I've created an issue in the original repository to see if anyone is facing the same problem.I'll investigate further on how to solve the problem.
Bug:
Issue description
I've executed the scripts
generate_dataset.py
andenerate_dataset_eval.py
for theElectronics
andBeauty
datasets. As a result, four files were generated. Two for each dataset, as shown below:Beauty_sequential,straightforward_sequential_train.json
Beauty_sequential,straightforward_sequential_validation_seen:0.json
Electronics_sequential,straightforward_sequential_train.json
Electronics_sequential,straightforward_sequential_validation_seen:0.json
However, when I execute the
train.py
file the following error happens:pyarrow.lib.ArrowInvalid: cannot mix list and non-list, non-null values
Does anyone knows what could be the cause of this error?