Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
When running my pipeline is seem to be getting this error:
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/distilabel/pipeline/base.py", line 734, in _output_queue_loop
self._process_batch(batch)
File "/usr/local/lib/python3.10/dist-packages/distilabel/pipeline/base.py", line 794, in _process_batch
self._write_buffer.add_batch(batch) # type: ignore
File "/usr/local/lib/python3.10/dist-packages/distilabel/pipeline/write_buffer.py", line 102, in add_batch
self._write(step_name)
File "/usr/local/lib/python3.10/dist-packages/distilabel/pipeline/write_buffer.py", line 135, in _write
table = table.cast(new_schema)
File "pyarrow/table.pxi", line 4547, in pyarrow.lib.Table.cast
ValueError: Target schema's field names are not matching the table's field names: ['listing_id', 'listing_text', 'profiles', 'instruction', 'generation', 'model_name', 'cv_sections'], ['cv_sections', 'profiles', 'listing_id', 'listing_text', 'instruction', 'generation', 'model_name']
It looks like the schema of the table and new_schema have to be in the exact same order, or else this error is raised. There is even a github issue amongst the pyarrow maintainers discussing whether they should relax this constraint (https://github.com/apache/arrow/issues/27425).
There needs to be some logic to rearrange the new_schema to be the same order as the table schema to avoid this I think.
When running my pipeline is seem to be getting this error:
It looks like the schema of the table and new_schema have to be in the exact same order, or else this error is raised. There is even a github issue amongst the pyarrow maintainers discussing whether they should relax this constraint (https://github.com/apache/arrow/issues/27425).
There needs to be some logic to rearrange the new_schema to be the same order as the table schema to avoid this I think.