bespokelabsai / curator

Synthetic Data curation for post-training and structured data extraction
https://docs.bespokelabs.ai/bespoke-curator
Apache License 2.0
23 stars 2 forks source link

create_dataset_files fails with casting error #142

Open RyanMarten opened 1 week ago

RyanMarten commented 1 week ago
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ray/anaconda3/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/working_dir_files/_ray_pkg_97e2f3f4563fca9a/dcft/generate.py", line 118, in <module>
    main(
  File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/working_dir_files/_ray_pkg_97e2f3f4563fca9a/dcft/generate.py", line 69, in main
    manager.run_framework(framework)
  File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/working_dir_files/_ray_pkg_97e2f3f4563fca9a/dcft/data_strategies/synthetic_data_manager.py", line 193, in run_framework
    self.run()
  File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/working_dir_files/_ray_pkg_97e2f3f4563fca9a/dcft/data_strategies/synthetic_data_manager.py", line 505, in run
    results = self.wait_for_results(waitables, no_return=self.no_return)
  File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/working_dir_files/_ray_pkg_97e2f3f4563fca9a/dcft/data_strategies/synthetic_data_manager.py", line 468, in wait_for_results
    shard_saved_successfully = ray.get(ready_waitable)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2664, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 871, in get_objects
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::_save_shard() (pid=3412950, ip=10.120.2.5)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::completions__oh-dcft-v3.1-gpt-4o-mini::reannotate_OH() (pid=3427327, ip=10.120.2.5, actor_id=6ef2e85c87f79db376b9b8e739000000, repr=<engine.operators.completions_operator._Completions object at 0x7a3a8dbcc9d0>)
  File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/working_dir_files/_ray_pkg_97e2f3f4563fca9a/engine/operators/completions_operator.py", line 288, in completions
    dataset = completion(dataset, working_dir=curator_cache_dir)
  File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/1cd47bbef4925bc60239bbf4436b7e80e54fd222/virtualenv/lib/python3.10/site-packages/bespokelabs/curator/prompter/prompter.py", line 128, in __call__
    return self._completions(self._request_processor, dataset, working_dir)
  File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/1cd47bbef4925bc60239bbf4436b7e80e54fd222/virtualenv/lib/python3.10/site-packages/bespokelabs/curator/prompter/prompter.py", line 221, in _completions
    dataset = request_processor.run(
  File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/1cd47bbef4925bc60239bbf4436b7e80e54fd222/virtualenv/lib/python3.10/site-packages/bespokelabs/curator/request_processor/openai_online_request_processor.py", line 186, in run
    dataset = self.create_dataset_files(
  File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/1cd47bbef4925bc60239bbf4436b7e80e54fd222/virtualenv/lib/python3.10/site-packages/bespokelabs/curator/request_processor/base_request_processor.py", line 318, in create_dataset_files
    writer.write(row)
  File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/1cd47bbef4925bc60239bbf4436b7e80e54fd222/virtualenv/lib/python3.10/site-packages/datasets/arrow_writer.py", line 537, in write
    self.write_examples_on_file()
  File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/1cd47bbef4925bc60239bbf4436b7e80e54fd222/virtualenv/lib/python3.10/site-packages/datasets/arrow_writer.py", line 495, in write_examples_on_file
    self.write_batch(batch_examples=batch_examples)
  File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/1cd47bbef4925bc60239bbf4436b7e80e54fd222/virtualenv/lib/python3.10/site-packages/datasets/arrow_writer.py", line 605, in write_batch
    arrays.append(pa.array(typed_sequence))
  File "pyarrow/array.pxi", line 250, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 114, in pyarrow.lib._handle_arrow_array_protocol
  File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/1cd47bbef4925bc60239bbf4436b7e80e54fd222/virtualenv/lib/python3.10/site-packages/datasets/arrow_writer.py", line 243, in __arrow_array__
    out = cast_array_to_feature(
  File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/1cd47bbef4925bc60239bbf4436b7e80e54fd222/virtualenv/lib/python3.10/site-packages/datasets/table.py", line 1797, in wrapper
    return func(array, *args, **kwargs)
  File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/1cd47bbef4925bc60239bbf4436b7e80e54fd222/virtualenv/lib/python3.10/site-packages/datasets/table.py", line 2013, in cast_array_to_feature
    casted_array_values = _c(array.values, feature[0])
  File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/1cd47bbef4925bc60239bbf4436b7e80e54fd222/virtualenv/lib/python3.10/site-packages/datasets/table.py", line 1797, in wrapper
    return func(array, *args, **kwargs)
  File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/1cd47bbef4925bc60239bbf4436b7e80e54fd222/virtualenv/lib/python3.10/site-packages/datasets/table.py", line 2005, in cast_array_to_feature
    arrays = [
  File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/1cd47bbef4925bc60239bbf4436b7e80e54fd222/virtualenv/lib/python3.10/site-packages/datasets/table.py", line 2006, in <listcomp>
    _c(array.field(name) if name in array_fields else null_array, subfeature)
  File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/1cd47bbef4925bc60239bbf4436b7e80e54fd222/virtualenv/lib/python3.10/site-packages/datasets/table.py", line 1797, in wrapper
    return func(array, *args, **kwargs)
  File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/1cd47bbef4925bc60239bbf4436b7e80e54fd222/virtualenv/lib/python3.10/site-packages/datasets/table.py", line 2102, in cast_array_to_feature
    return array_cast(
  File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/1cd47bbef4925bc60239bbf4436b7e80e54fd222/virtualenv/lib/python3.10/site-packages/datasets/table.py", line 1797, in wrapper
    return func(array, *args, **kwargs)
  File "/tmp/ray/session_2024-11-14_19-11-03_110436_1/runtime_resources/pip/1cd47bbef4925bc60239bbf4436b7e80e54fd222/virtualenv/lib/python3.10/site-packages/datasets/table.py", line 1948, in array_cast
    raise TypeError(f"Couldn't cast array of type {_short_str(array.type)} to {_short_str(pa_type)}")
TypeError: Couldn't cast array of type double to null
RyanMarten commented 1 week ago

This happened on a large scale job with 1M responses... one of them must be funky

Do

                            try:
                                writer.write(row)
                            except TypeError as e:
                                import pdb; pdb.set_trace()
RyanMarten commented 1 week ago
{'conversations': [{'from': 'system', 'value': 'You are an AI assistant. User will you give you a task. Your goal is to complete the task as faithfully as you can. While performing the task think step-by-step and justify your steps.', 'weight': None}, {'from': 'human', 'value': 'Based on this review, would the user recommend this product? === Review: PIG DESTROYER - Terrifyer-This is some dark, twisted and evil music... yet I can\'t help but love it! PxDx is a 3 piece grind band that fuses many influences and has a sound so thick that it would consume many 5 pieces in their entirety. The blister guitar work of Scott Hull drives the charge while the rhythmically precise drumming of Brian Harvey holds it together. J.R. Hayes, who has my vote for the craziest lyricist vocalist since Today is the Day\'s Steve Austin, is in charge on the mic, pushing chaotic extreme vocals to new levels with a voice which is brutal yet still (at times) audible.So what you ask does PxDx sound like you might ask?Devastation! Destruction! Insanity! Chaos! Violence! The Apocalypse!........For extreme metal fans only, this is a treat for the ears.Favorite Songs: Thumbsucker, Gravedancer, Carrion Fairy, Towering Flesh, and The Gentleman.-4.5 StarsIF YOU LIKED, AGREED OR APPRECIATED THIS. PLEASE CLICK YES FOR:"Was this review helpful?" Answer:\nChoose your answer from: i. No ii. Yes\nAnswer:', 'weight': 0.0}, {'from': 'gpt', 'value': 'ii. Yes', 'weight': 1.0}], 'instruction': 'Based on this review, would the user recommend this product? === Review: PIG DESTROYER - Terrifyer-This is some dark, twisted and evil music... yet I can\'t help but love it! PxDx is a 3 piece grind band that fuses many influences and has a sound so thick that it would consume many 5 pieces in their entirety. The blister guitar work of Scott Hull drives the charge while the rhythmically precise drumming of Brian Harvey holds it together. J.R. Hayes, who has my vote for the craziest lyricist vocalist since Today is the Day\'s Steve Austin, is in charge on the mic, pushing chaotic extreme vocals to new levels with a voice which is brutal yet still (at times) audible.So what you ask does PxDx sound like you might ask?Devastation! Destruction! Insanity! Chaos! Violence! The Apocalypse!........For extreme metal fans only, this is a treat for the ears.Favorite Songs: Thumbsucker, Gravedancer, Carrion Fairy, Towering Flesh, and The Gentleman.-4.5 StarsIF YOU LIKED, AGREED OR APPRECIATED THIS. PLEASE CLICK YES FOR:"Was this review helpful?" Answer:\nChoose your answer from: i. No ii. Yes\nAnswer:', 'response': 'ii. Yes', 'model_response': 'ii. Yes'}

It actually has to do with the weight field being None here.

Fast solution is to drop the conversations column before processing. However we should add error handling for this and print out the type error with a helpful message

There might also be a way to more intelligently cast the value