poor performance w/ Ibis

lostmygithubaccount commented 5 months ago

hi @djouallah, in response to: https://github.com/ibis-project/ibis/issues/9408

we still don't understand the underlying issue (will respond there when we do), but to get Ibis performance on-par with DuckDB you can work around it by doing the same as for DuckDB and some of the other frameworks:

call ibis_table.to_pyarrow() in your function
use the deltalake.writer.write_deltalake function directly on the Arrow table

the only difference between that and ibis_table.to_delta() that I can see is that it's writing with PyArrow batches: https://github.com/ibis-project/ibis/blob/main/ibis/backends/__init__.py#L538-L548

we'll respond on the main issue once we understand better

lostmygithubaccount commented 5 months ago

e.g. (I didn't move the to_pyarrow() into the function:

%%time
from deltalake.writer import write_deltalake

chunk_len = total_files
if len(files_to_upload_full_Path) > 0:
    for i in range(0, len(files_to_upload_full_Path), chunk_len):
        chunk = files_to_upload_full_Path[i : i + chunk_len]
        ##########################
        start = time.time()
        df = ibis_clean_csv(chunk)
        print(f"time to clean: {time.time() - start}")
        # df.to_delta(
        #     "./lakehouse/default/Tables/scada_ibis",
        #     mode="append",
        #     partition_by=["year"],
        #     storage_options={"allow_unsafe_rename": "true"},
        # )
        write_deltalake(
            "./lakehouse/default/Tables/scada_ibis2",
            df.to_pyarrow(),
            mode="append",
            partition_by=["year"],
            storage_options={"allow_unsafe_rename": "true"},
        )
        del df
        print("Ibis total:" + str(time.time() - start))
        results = pd.concat(
            [
                pd.DataFrame(
                    [["Ibis", i, time.time() - start]], columns=results.columns
                ),
                results,
            ]
        )

djouallah commented 5 months ago

it is better now, 69 second but still double the performance of duckdb ?

lostmygithubaccount commented 5 months ago

let's continue over in https://github.com/ibis-project/ibis/issues/9408 (I probably should have just responded there), I did not see a 2x difference like that but need to re-download the data and double check

djouallah / Fabric_Notebooks_Demo

poor performance w/ Ibis #1