argilla-io / distilabel

Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
https://distilabel.argilla.io
Apache License 2.0
1.41k stars 99 forks source link

[BUG] ValueError raised in write_buffer.py when pyarrow.Table.cast is called #935

Open afolabiaji opened 3 weeks ago

afolabiaji commented 3 weeks ago

When running my pipeline is seem to be getting this error:

Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/distilabel/pipeline/base.py", line 734, in _output_queue_loop
    self._process_batch(batch)
  File "/usr/local/lib/python3.10/dist-packages/distilabel/pipeline/base.py", line 794, in _process_batch
    self._write_buffer.add_batch(batch)  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/distilabel/pipeline/write_buffer.py", line 102, in add_batch
    self._write(step_name)
  File "/usr/local/lib/python3.10/dist-packages/distilabel/pipeline/write_buffer.py", line 135, in _write
    table = table.cast(new_schema)
  File "pyarrow/table.pxi", line 4547, in pyarrow.lib.Table.cast

ValueError: Target schema's field names are not matching the table's field names: ['listing_id', 'listing_text', 'profiles', 'instruction', 'generation', 'model_name', 'cv_sections'], ['cv_sections', 'profiles', 'listing_id', 'listing_text', 'instruction', 'generation', 'model_name']

It looks like the schema of the table and new_schema have to be in the exact same order, or else this error is raised. There is even a github issue amongst the pyarrow maintainers discussing whether they should relax this constraint (https://github.com/apache/arrow/issues/27425).

There needs to be some logic to rearrange the new_schema to be the same order as the table schema to avoid this I think.

gabrielmbmb commented 1 week ago

Hi @afolabiaji, I'll check and see if we can do something on distilabel side to fix the issue.