fugue-project / fugue

A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.
https://fugue-tutorials.readthedocs.io/
Apache License 2.0
1.98k stars 94 forks source link

[BUG] fugue_sql intermittently throwing segmentation fault errors #462

Open jstammers opened 1 year ago

jstammers commented 1 year ago

Minimal Code To Reproduce

Describe the bug I have a set of unit tests that check the functionality of code that uses the fugue_sql API with a DuckDB backend. When running these tests locally, they all pass without any issue. However, when I run these as part of a Github actions workflow, I frequently encounter a segmentation fault that occurs at the following location

Current thread 0x00007f4e615547[40](https://github.com/****/****/actions/runs/4555672657/jobs/8035039892#step:7:41) (most recent call first):
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue_duckdb/dataframe.py", line 101 in as_arrow
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue_duckdb/dataframe.py", line 110 in as_local_bounded
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue/dataframe/dataframe.py", line 90 in as_local
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue_duckdb/execution_engine.py", line 521 in convert_yield_dataframe
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue/workflow/_tasks.py", line 1[47](https://github.com/****/****/actions/runs/4555672657/jobs/8035039892#step:7:48) in set_result
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue/workflow/_tasks.py", line 293 in execute
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/adagio/instances.py", line 683 in run
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/adagio/instances.py", line 171 in run_single
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/adagio/instances.py", line 155 in run_tasks
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/adagio/instances.py", line 129 in run
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/adagio/instances.py", line 270 in run
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue/workflow/_workflow_context.py", line 54 in run
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue/workflow/workflow.py", line 1584 in run
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue/sql/api.py", line 107 in fugue_sql

The function that fails has the following form

def filter_df(
    df: pd.DataFrame,
    outlets: pd.DataFrame,
    adjustments: pd.DataFrame,
):
    query = """keys = SELECT DateId, ProductId, LocationId, AdjustmentFactor, AdjustmentType, id
    FROM adjustments INNER JOIN outlets USING (LocationId)
    fdt = SELECT * FROM keys INNER JOIN df USING (DateId, ProductId, LocationId)"""
    result = fa.fugue_sql(
        query,
        df=df,
        outlets=outlets,
        adjustments=adjustments,
        engine='duckdb',
        as_fugue=True,
    )
    return result.as_pandas()

And I have multiple unit tests that call this function. It's difficult to fully isolate the problem as I can't fully reproduce it locally.

In this instance, I have been able to refactor my function to use the fugue api, but it would be good to be able to use the fugue_sql API for more complex queries where the SQL syntax is more suitable.

from fugue import api as fa

df = fa.join(...)
df = fa.filter(...)

Expected behavior I would expect these unit tests to run successfully.

Environment (please complete the following information):

goodwanghan commented 1 year ago

@jstammers thanks for reporting. What duckdb version are you using?

I remember in earlier Duckdb versions (<3), I often saw segment fault but in later versions I have never seen this happening.

goodwanghan commented 1 year ago

One problem I saw in unit tests of duckdb is that it can have weird behaviors because the duckdb connection are not properly closed at certain step so the following steps are having issues.

jstammers commented 1 year ago

Hi @goodwanghan, thanks for looking into this. I'm currently using 0.7.1 which I believe is the latest version. It wouldn't surprise me if it's related to trying to a previous duckdb connection not being properly closed, but for now I will stick with the fugue API.