Closed lostmygithubaccount closed 5 months ago
e.g. (I didn't move the to_pyarrow()
into the function:
%%time
from deltalake.writer import write_deltalake
chunk_len = total_files
if len(files_to_upload_full_Path) > 0:
for i in range(0, len(files_to_upload_full_Path), chunk_len):
chunk = files_to_upload_full_Path[i : i + chunk_len]
##########################
start = time.time()
df = ibis_clean_csv(chunk)
print(f"time to clean: {time.time() - start}")
# df.to_delta(
# "./lakehouse/default/Tables/scada_ibis",
# mode="append",
# partition_by=["year"],
# storage_options={"allow_unsafe_rename": "true"},
# )
write_deltalake(
"./lakehouse/default/Tables/scada_ibis2",
df.to_pyarrow(),
mode="append",
partition_by=["year"],
storage_options={"allow_unsafe_rename": "true"},
)
del df
print("Ibis total:" + str(time.time() - start))
results = pd.concat(
[
pd.DataFrame(
[["Ibis", i, time.time() - start]], columns=results.columns
),
results,
]
)
it is better now, 69 second but still double the performance of duckdb ?
let's continue over in https://github.com/ibis-project/ibis/issues/9408 (I probably should have just responded there), I did not see a 2x difference like that but need to re-download the data and double check
hi @djouallah, in response to: https://github.com/ibis-project/ibis/issues/9408
we still don't understand the underlying issue (will respond there when we do), but to get Ibis performance on-par with DuckDB you can work around it by doing the same as for DuckDB and some of the other frameworks:
ibis_table.to_pyarrow()
in your functiondeltalake.writer.write_deltalake
function directly on the Arrow tablethe only difference between that and
ibis_table.to_delta()
that I can see is that it's writing with PyArrow batches: https://github.com/ibis-project/ibis/blob/main/ibis/backends/__init__.py#L538-L548we'll respond on the main issue once we understand better