dask / dask

Parallel computing with task scheduling
https://dask.org
BSD 3-Clause "New" or "Revised" License
12.44k stars 1.7k forks source link

Negative lookahead suddenly incorrectly parsed #11226

Closed manschoe closed 2 months ago

manschoe commented 2 months ago

In Dask 2024.2.1 we suddenly have an issue with a regex with a negative lookahead. It somehow is invalid now.

import dask.dataframe as dd
regex = 'negativelookahead(?!/check)'
ddf = dd.from_dict(
    {
        "test": ["negativelookahead", "negativelookahead/check/negativelookahead", ],
    },
    npartitions=1)
ddf["test"].str.contains(regex).head()

This results in the following error:

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[2], line 8
      2 regex = 'negativelookahead(?!/check)'
      3 ddf = dd.from_dict(
      4     {
      5         "test": ["negativelookahead", "negativelookahead/check/negativelookahead", ],
      6     },
      7     npartitions=1)
----> 8 ddf["test"].str.contains(regex).head()

File /opt/conda/lib/python3.10/site-packages/dask_expr/_collection.py:702, in FrameBase.head(self, n, npartitions, compute)
    700 out = new_collection(expr.Head(self, n=n, npartitions=npartitions))
    701 if compute:
--> 702     out = out.compute()
    703 return out

File /opt/conda/lib/python3.10/site-packages/dask_expr/_collection.py:476, in FrameBase.compute(self, fuse, **kwargs)
    474     out = out.repartition(npartitions=1)
    475 out = out.optimize(fuse=fuse)
--> 476 return DaskMethodsMixin.compute(out, **kwargs)

File /opt/conda/lib/python3.10/site-packages/dask/base.py:375, in DaskMethodsMixin.compute(self, **kwargs)
    351 def compute(self, **kwargs):
    352     """Compute this dask collection
    353 
    354     This turns a lazy Dask collection into its in-memory equivalent.
   (...)
    373     dask.compute
    374     """
--> 375     (result,) = compute(self, traverse=False, **kwargs)
    376     return result

File /opt/conda/lib/python3.10/site-packages/dask/base.py:661, in compute(traverse, optimize_graph, scheduler, get, *args, **kwargs)
    658     postcomputes.append(x.__dask_postcompute__())
    660 with shorten_traceback():
--> 661     results = schedule(dsk, keys, **kwargs)
    663 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])

File /opt/conda/lib/python3.10/site-packages/dask_expr/_expr.py:3727, in Fused._execute_task(graph, name, *deps)
   3725 for i, dep in enumerate(deps):
   3726     graph["_" + str(i)] = dep
-> 3727 return dask.core.get(graph, name)

File /opt/conda/lib/python3.10/site-packages/dask_expr/_accessor.py:102, in FunctionMap.operation(obj, accessor, attr, args, kwargs)
    100 @staticmethod
    101 def operation(obj, accessor, attr, args, kwargs):
--> 102     out = getattr(getattr(obj, accessor, obj), attr)(*args, **kwargs)
    103     return maybe_wrap_pandas(obj, out)

File /opt/conda/lib/python3.10/site-packages/pyarrow/compute.py:263, in _make_generic_wrapper.<locals>.wrapper(memory_pool, options, *args, **kwargs)
    261 if args and isinstance(args[0], Expression):
    262     return Expression._call(func_name, list(args), options)
--> 263 return func.call(args, options, memory_pool)

File /opt/conda/lib/python3.10/site-packages/pyarrow/_compute.pyx:385, in pyarrow._compute.Function.call()

File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:154, in pyarrow.lib.pyarrow_internal_check_status()

File /opt/conda/lib/python3.10/site-packages/pyarrow/error.pxi:91, in pyarrow.lib.check_status()

ArrowInvalid: Invalid regular expression: invalid perl operator: (?!

Environment:

phofl commented 2 months ago

Thanks for your report.

Can you add the version that still worked?

manschoe commented 2 months ago

Previously working version was: 2023.9.1

phofl commented 2 months ago

Dask is converting your strings to arrow backed strings under the hood to improve performance and memory usage. Arrow unfortunately doesn't support lookahead regex expressions, see https://github.com/apache/arrow/issues/40220

You can disable this through

dask.config.set({"dataframe.convert-string": False})

but this will slow you down and increase memory consumption by quite a bit

Closing here since there is nothing we can do on the Dask side