duckdblabs / db-benchmark

reproducible benchmark of database-like ops
https://duckdblabs.github.io/db-benchmark/
Mozilla Public License 2.0
143 stars 27 forks source link

Clean-up `/tmp` dirctory #15

Open sl-solution opened 1 year ago

sl-solution commented 1 year ago

I notice that on-disk solutions may create large temporary files during their runs, however, they may not clean up afterward (e.g. polars creates .ipc files). This may cause the undefined exception error for other solutions, when they run within the same session.

hkpeaks commented 1 year ago

Today I have done benchmark for DuckDB https://youtu.be/zVR77B2bDR0 The tmp file shall be cleaned after process completed.

Tmonster commented 1 year ago

Can you provide reproducible steps for when an undefined exception is caused by a temporary file from a different solution (in the same session)?

sl-solution commented 1 year ago

Can you provide reproducible steps for when an undefined exception is caused by a temporary file from a different solution (in the same session)?

I found this when I was using the _utils/repro.sh script to reproduce result for smaller data sets on a computer with limited hard disk. I noted that after some point all solutions failed to produce any result, and with a little investigation I figured out that the hard drive was full (due to temporary file created during the benchmark run). I would image for large data sets the /tmp directory would be bloated by huge files.

jangorecki commented 1 year ago

I can confirm that disk space was never a concern and scripts generally won't be handling this kind of exception.

Tmonster commented 1 year ago

I noticed this issue too actually when getting the benchmark back up and running. I never had the issue where another solution encountered an undefined exception.

Tmonster commented 1 year ago

@sl-solution If you still believe this would be a problem, feel free to open a PR to automatically clean the /tmp directory after every run.

sl-solution commented 1 year ago

@sl-solution If you still believe this would be a problem, feel free to open a PR to automatically clean the /tmp directory after every run.

In Juliads I made sure it is done automatically, however, I am not sure deleting everything from /tmp is a good idea, since some of the files may be essential for other system process.

Tmonster commented 1 year ago

I wouldn't delete everything from /tmp of course, but for R solutions it would be everything in tempdir(). Potentially all R solutions could use the same location for tmpdir() and then it could be cleaned up when the benchmarking ends

sl-solution commented 1 year ago

I guess for polars it should be straightforward, since it uses absolute path and constant name for temporary files.

hkpeaks commented 1 year ago

I think sorting of billion rows requires the use of temporary. I have coded for billion-row jointable/filter/groupby using only 32GB ram, in fact it is certified no need using temp file.

sl-solution commented 1 year ago

I think a systematic way to solve the issue is to assign a directory for temporary files, and ask every solution to use solely the assigned directory for on-disk calculations. The launcher can clean the directory after each run.

Tmonster commented 10 months ago

Since the new machine has more memory, and instance storage, this has become less of an issue. Can this therefore be closed?

sl-solution commented 10 months ago

Since the new machine has more memory, and instance storage, this has become less of an issue. Can this therefore be closed?

I guess as long as solutions keep using temp files, this will be an issue.