RLHF-V / RLAIF-V

RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness
197 stars 6 forks source link

Error loading the parquet dataset #5

Closed charismaticchiu closed 2 months ago

charismaticchiu commented 2 months ago

Hi I am getting this error loading the DPO dataset, does anyone know how to resolve it? Thank you!

I have this error even when my pandas version is 2.2.2

pd.read_parquet("code/eagle-dev/RLAIF-V-Dataset/RLAIF-V-Dataset_with_logp_llava15_base.parquet") Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.10/dist-packages/pandas/io/parquet.py", line 503, in read_parquet return impl.read( File "/usr/local/lib/python3.10/dist-packages/pandas/io/parquet.py", line 251, in read result = self.api.parquet.read_table( File "/usr/local/lib/python3.10/dist-packages/pyarrow/parquet/core.py", line 1811, in read_table return dataset.read(columns=columns, use_threads=use_threads, File "/usr/local/lib/python3.10/dist-packages/pyarrow/parquet/core.py", line 1454, in read table = self._dataset.to_table( File "pyarrow/_dataset.pyx", line 562, in pyarrow._dataset.Dataset.to_table File "pyarrow/_dataset.pyx", line 3804, in pyarrow._dataset.Scanner.to_table File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowNotImplementedError: Nested data conversions not implemented for chunked array outputs

Haoye17 commented 2 months ago

Hello @charismaticchiu,

Thank you very much for your interest in our work! Regarding the issue you mentioned, we are currently investigating the cause and expect to have an update within the next couple of days.

Thank you again for your interest!

Haoye17 commented 2 months ago

Hello @charismaticchiu,

After investigating, we find that the original code saves all data into a single parquet file, which results in a file that is too large and likely causes the error you've encountered.

To address this, we update the inference logp code and the data reading code in the dataset. The logp files will now be stored in chunks of 5000 rows per file.

Please update your code to the latest version and see if this resolves the issue ~

If you have any other questions, please don't hesitate to ask. We're here to help!

charismaticchiu commented 2 months ago

Was using pickle as temporary hack. Also the new updates should work too. Thanks!