Summary:

Made improvements to the project, including adding a .gitignore file, enhancing the README.md (documentation on how to run the project), specifying dependencies in requirements.txt, fixing bugs in dataset generation scripts (generate_dataset.py and generate_dataset_eval.py), and adding print statements in train.py for better visualization.

I'm still facing bugs while executing the train.py file. I've created an issue in the original repository to see if anyone is facing the same problem.

I'll investigate further on how to solve the problem.

Bug:

Traceback (most recent call last):                                                                                                                                  
  File "/Users/igorlimarochaazevedo/Programming/UTokyo/LabSuzumura/OpenP5-fork/src/train.py", line 221, in <module>                                                 
  File "/Users/igorlimarochaazevedo/Programming/UTokyo/LabSuzumura/OpenP5-fork/src/train.py", line 135, in main                                                     

  File "/Users/igorlimarochaazevedo/Programming/UTokyo/LabSuzumura/OpenP5-fork/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 592, in wrapper    
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)                                                                                              
  File "/Users/igorlimarochaazevedo/Programming/UTokyo/LabSuzumura/OpenP5-fork/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 557, in wrapper    
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)                                                                                              
  File "/Users/igorlimarochaazevedo/Programming/UTokyo/LabSuzumura/OpenP5-fork/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3097, in map       
    for rank, done, content in Dataset._map_single(**dataset_kwargs):                                                                                               
  File "/Users/igorlimarochaazevedo/Programming/UTokyo/LabSuzumura/OpenP5-fork/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3493, in _map_singl
e                                                                                                                                                                   
    writer.write_batch(batch)                                                                                                                                       
  File "/Users/igorlimarochaazevedo/Programming/UTokyo/LabSuzumura/OpenP5-fork/venv/lib/python3.9/site-packages/datasets/arrow_writer.py", line 555, in write_batch 
    arrays.append(pa.array(typed_sequence))                                                                                                                         
  File "pyarrow/array.pxi", line 243, in pyarrow.lib.array                                                                                                          
  File "pyarrow/array.pxi", line 110, in pyarrow.lib._handle_arrow_array_protocol                                                                                   
  File "/Users/igorlimarochaazevedo/Programming/UTokyo/LabSuzumura/OpenP5-fork/venv/lib/python3.9/site-packages/datasets/arrow_writer.py", line 189, in __arrow_arra
y__                                                                                                                                                                 
    out = pa.array(cast_to_python_objects(data, only_1d_for_numpy=True))                                                                                            
  File "pyarrow/array.pxi", line 327, in pyarrow.lib.array                                                                                                          
  File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array                                                                                              
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status 
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: cannot mix list and non-list, non-null values

Issue description

I've executed the scripts generate_dataset.py and enerate_dataset_eval.py for the Electronics and Beauty datasets. As a result, four files were generated. Two for each dataset, as shown below:

Beauty_sequential,straightforward_sequential_train.json
Beauty_sequential,straightforward_sequential_validation_seen:0.json
Electronics_sequential,straightforward_sequential_train.json
Electronics_sequential,straightforward_sequential_validation_seen:0.json

However, when I execute the train.py file the following error happens:

pyarrow.lib.ArrowInvalid: cannot mix list and non-list, non-null values

Does anyone knows what could be the cause of this error?

agiresearch / OpenP5

refactor: first commit for general project structure #6

Summary:

Bug:

Issue description