Open cboettig opened 2 years ago
How much RAM does the environment have? The aggregation likely runs out of memory here. We still need to add an out-of-core aggregation operator.
Thanks! I'm running on a machine with 32 GB RAM (also apologies, just noticed my S3 bucket isn't configured right, fixing that now so the above example should work again)
This won't fix the problem, but with DuckDB 0.3.3 or above you could simplify your connection code to local parquet files:
parquet <- file.path("/home/shared-data/ebird/observations/*.parquet")
conn <- DBI::dbConnect(duckdb(), "/home/shared-data/ebird/db", config=list("memory_limit"="12GB"))
obs <- tbl(conn, paste0("read_parquet('", parquet, "')"))
And it may be possible to avoid crash in the arrow-version by giving a memory-limited connection to to_duckdb()
as the default connection created otherwise does not define memory limit:
path <- arrow::s3_bucket("ebird/observations", endpoint_override = "minio.carlboettiger.info")
conn <- DBI::dbConnect(duckdb(), "/home/shared-data/ebird/db", config=list("memory_limit"="12GB"))
obs <- arrow::open_dataset(path) |> to_duckdb(conn)
The download-link you gave was not (yet) working, so I couldn't test with the actual data.
FYI, in case it would be useful to you, there is an unofficial DuckDB R package (updated after each merge to DuckDB master branch that also contains DuckDB extensions which can be easily compiled with the default R toolchain) available in my r-universe: https://rsund.r-universe.dev/ui#package:duckdb
Thanks @rsund ! hmm, thanks for sharing the above snytax, that's cool that you don't need to explicitly declare the creation of a VIEW
or execute a PRAGMA
.
Also I've fixed my S3 bucket address, sorry (forgot to include the subdir for month)
library(arrow)
library(dplyr)
library(duckdb)
path <- arrow::s3_bucket("ebird/Mar-2022/observations", endpoint_override = "minio.carlboettiger.info", anonymous=TRUE)
conn <- DBI::dbConnect(duckdb(), ":memory:", config=list("memory_limit"="12GB"))
obs <- arrow::open_dataset(path) |> to_duckdb(conn)
tmp <- obs |>
group_by(sampling_event_identifier, scientific_name) |>
summarize(count = sum(observation_count, na.rm=TRUE),
.groups = "drop")
tmp <- tmp |> compute() # crashes
Still crashing though, I suspect because of what @hannes pointed out already about the aggregation. Is there an existing issue thread I should watch for that? :eyes:
Well, you should not use :memory:
database in this case unless you define also a temp_directory
:
conn <- DBI::dbConnect(duckdb(), ":memory:", config=list("memory_limit"="12GB", "temp_directory" = "/tmp"))
or use a file-based database:
conn <- DBI::dbConnect(duckdb(), "database.duckdb", config=list("memory_limit"="12GB"))
It is also unclear to me how Arrow handles memory allocation for online data sources and when queries for those are run by DuckDB there may be in the worst case a need for having the whole required columns (temporarily) in memory. In principle, that should not be the case as data is scanned in chunks and the excess memory consumption will be related only to the aggregate step that is not yet tuned for using only limited memory.
In this particular case it should be quite straightforward to "manually" run the aggregate query in chunks if there is some sensible variable to be used in splitting the data into pieces. But certainly it is more work and shouldn't be needed here as both Arrow and DuckDB are expected to take care of larger than memory data.
About the "new" features in R client: config
argument has existed there a long time already, but it was just not documented and all examples used PRAGMA
to change the setup. The support for registering parquet and other files / functions directly in dplyr::tbl()
is, however, a new feature. With DuckDB 0.3.3 also the DuckDB-tailored backend for dbplyr
was released and now e.g. the tidyr::fill()
works in queries as expected.
Thanks @rsund , these details are super helpful to me.
Re using :memory:
, I didn't realize that a temp dir is not set by default. is there a good reason that the R client does not default to the R tempdir automatically for this? (unfortunately, in this case the high memory use occurs even with a file-based database, as in my previous example).
Yeah, it's also unclear to me how arrow handles memory allocations in this context. FWIW, as you probably know the same queries are possible in 'pure arrow' without the use of duckdb, but result in similar OOM crashes there, though perhaps for quite different reasons. (e.g. https://issues.apache.org/jira/browse/ARROW-15081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531027#comment-17531027). I'm a huge fan of these packages but like you say, being able to seamlessly work with larger-than-RAM data is really the killer feature here.
Thanks also for the heads up that the tidyr::fill()
issue has been addressed, that's awesome. I'm closing https://github.com/duckdb/duckdb/issues/2372 then!
Arrow runs into the same problem. For DuckDB, out-of-core capability is absolutely one of our main goals, its just not supported yet in every operator. The suggestion to use the R tempdir automatically is interesting.
@lnkuiper if you needed any reason to look at the aggregates again ^^
I think one approach could be to summarize over some partition, then aggregate, at least for simple counts. Often arrow files are partitioned by something. However, arrow doesn't have an intrinsic chunking method (by row number) that I'm aware of beyond the splitting of parquet files. duckdb offers traditional windows so perhaps specifying a chunk size (nlines), windowing, and then mapping over the windows with a final summarization step would be appropriate. I don't know if this belongs inside of the duckdb R package or rather as an ad-hoc piece of code.
@hannes et al any news on out-of-core capability in aggregates?
In some news, at least I can now reproduce the crash in pure SQL without the arrow wrapper for the S3 bit :wink: :
INSTALL httpfs;
LOAD httpfs;
SET s3_endpoint='minio.carlboettiger.info';
SET s3_url_style='path';
COPY (
SELECT *, count > 0.0 AS species_detected
FROM (
SELECT
sampling_event_identifier,
scientific_name,
SUM(observation_count) AS count
FROM (
SELECT sampling_event_identifier, scientific_name, observation_count
FROM read_parquet("s3://ebird/observations/*")
)
GROUP BY sampling_event_identifier, scientific_name
)) TO 'output.parquet' (FORMAT PARQUET);
This is a public bucket, so if I use the https:// addressing and limit to something considerably less than all 1124 partitions in the dataset, it doesn't crash. as such I can probably hack together the query by working over partitions in chunks manually but that seems less than ideal...
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.
Any news on out-of-core capability in this case, or ability to leverage temp dir to avoid OOM kills on these operations? (as you already know, the above operations still lead to the OOM killer even in pure SQL.
@cboettig Not sure, but potentially this PR will address your issue? https://github.com/duckdb/duckdb/pull/8475
@Tmonster Thanks for flagging me! I just tested the examples above again having installed from github, but sadly duckdb still just consumes all available RAM and crashes
Maybe also something where @lnkuiper could have a look - can we make a pure SQL repro for this?
pure SQL repo here, https://github.com/duckdb/duckdb-r/issues/72
@cboettig do you think you could make the dataset publicly accessible again?
I think the pure SQL example above should still be hitting a public-read bucket.
INSTALL httpfs;
LOAD httpfs;
SET s3_endpoint='minio.carlboettiger.info';
SET s3_url_style='path';
COPY (
SELECT *, count > 0.0 AS species_detected
FROM (
SELECT
sampling_event_identifier,
scientific_name,
SUM(observation_count) AS count
FROM (
SELECT sampling_event_identifier, scientific_name, observation_count
FROM read_parquet("s3://ebird/observations/*")
)
GROUP BY sampling_event_identifier, scientific_name
)) TO 'output.parquet' (FORMAT PARQUET);
lemme know if that's not working for you?
Hi, I've been able to reproduce it. It seems to be a bug indeed. I'm working on narrowing down the exact issue, but a current workaround would be to
create table observations as select * from read_parquet("s3://ebird/observations/*");
create table answer as SELECT *, count > 0.0 AS species_detected
FROM (
SELECT
sampling_event_identifier,
scientific_name,
SUM(observation_count) AS count
FROM (
SELECT sampling_event_identifier, scientific_name, observation_count
FROM observations
)
GROUP BY sampling_event_identifier, scientific_name
);
SET preserve_insertion_order=false;
copy answer to 'output.parquet' (FORMAT PARQUET);
If you have a persistent database or set a temporary directory for an in-memory database this should work. Worked on my laptop with 16GB.
Thanks much for this @Tmonster , it's great to be able to work around the issue this way.
I'm still keen on having a way for users to execute this remotely without having to create the local copy, but it's really nice for this to be unstuck. Please share if you've had any further luck in pinpointing the exact issue that causes this query to OOM
@Tmonster maybe I spoke too soon? this is still hitting an OOM error on my machine with 48 GB RAM. Maybe this is a secondary issue with how back-pressure is implemented, since I'm running inside a container capped at 48GB on a host machine with 64 GB RAM. I am setting both a tempdir path PRAGMA and a max memory limit PRAGMA.
Running on Ubuntu 22.04 on amd64 arch. (Here's the R version, though note it's just passing the SQL as above, I don't think R is really involved in the issue)
library(duckdb)
con <- dbConnect(duckdb())
pragmas <- paste0(
"PRAGMA memory_limit='32GB';
PRAGMA temp_directory='", tempdir(), "';")
dbExecute(conn = con, pragmas)
query <-
"
INSTALL httpfs;
LOAD httpfs;
SET s3_endpoint='minio.carlboettiger.info';
SET s3_url_style='path';
create table observations as select * from read_parquet('s3://ebird/observations/*');
create table answer as SELECT *, count > 0.0 AS species_detected
FROM (
SELECT
sampling_event_identifier,
scientific_name,
SUM(observation_count) AS count
FROM (
SELECT sampling_event_identifier, scientific_name, observation_count
FROM observations
)
GROUP BY sampling_event_identifier, scientific_name
);
SET preserve_insertion_order=false;
copy answer to 'ebird.parquet' (FORMAT PARQUET);
"
dbSendQuery(con, query)
Hmmm, I'm not positive, but there may be a difference in setting up a tempdir vs connecting to a persistent database file
Can you try replacing the second line with con <- dbConnect(duckdb(),dbdir='test.db')
If you do that you won't need to establish a temp directory.
good idea but no luck there, still consumes all available RAM (even on my larger machine with 120 GB) and crashes
good idea but no luck there, still consumes all available RAM (even on my larger machine with 120 GB) and crashes
Just gathering some more potentially useful info:
How many cores does the machine have, and how many does duckdb detect? (select current_setting('threads');
)
Also something worth trying to further troubleshoot the issue: Can you reproduce the issue using the CLI ? Since you mentioned it is just using raw SQL
Thanks -- great ideas here. I did try running in the CLI, and setting threads to 1 with PRAGMA threads=1;
, (the test machines have 24 and 128 threads respectively) unfortunately this still crashes OOM for me. hmmmm! :thinking:
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 30 days.
hey everyone,
I have the same problem of excessive memory use and memory_limits that are apparently not respected.
In the code below I created a 450M rows and 5 columns by replicating iris 3 million times. It takes about 16GB of RAM according to lobstr::obj_size().
I am ableto dbWriteTable() the table from memory to duckdb, but the R process memory usage doubles to reach 37GB according to htop's RES column. I have set memory_limit to 1GB and threads to 1.
After this, I restart the session, reconnect to the duckdb database and try to collect to data in RAM. Again, the memory usage reaches over 32GB while reading, before dropping down to around 16GB when it is done.
Is this "RAM usage double of data size" situation normal?
library(DBI)
library(dplyr)
library(dbplyr)
library(duckdb)
duckdb_path <- "/devroot/sandbox/tmp/duckdb.duckdb"
con <- dbConnect(duckdb::duckdb(dbdir = duckdb_path), config=list("memory_limit"="1GB") )
#dbExecute(con, "PRAGMA threads=1; PRAGMA memory_limit='1GB';")
# run this once to create the parquet file then restart session:
if (FALSE){
bigdata <- data.table::rbindlist(rlang::rep_along(1:3e6, list(iris)))
dim(bigdata) # 450M rows, 5 columns
lobstr::obj_size(bigdata) # 16.20 GB in RAM
dbWriteTable(con, "straight_from_memory", bigdata)
}
bigdata <- tbl(con, "straight_from_memory") %>% collect()
Hi Simon,
I think what you are seeing here is the effect of DuckDB materializing the data from the table, allocating the R data frame, then copying the data into R.
DuckDB doesn't know how much space it should allocate until it has all of the data, so first all the data needs to be materialized in DuckDB's memory (~16GB). Then DuckDB needs to allocate an R data frame to hold all the memory (~also 16GB) so that R can use the data. Once Duckdb passes off the data, it lives only in R, which is why you see 16GB of memory in use again.
Interesting that DuckDB doesn't provide any feedback that the memory limit has been reached in this case. That is something I can look into.
Hi Simon,
I think what you are seeing here is the effect of DuckDB materializing the data from the table, allocating the R data frame, then copying the data into R.
DuckDB doesn't know how much space it should allocate until it has all of the data, so first all the data needs to be materialized in DuckDB's memory (~16GB). Then DuckDB needs to allocate an R data frame to hold all the memory (~also 16GB) so that R can use the data. Once Duckdb passes off the data, it lives only in R, which is why you see 16GB of memory in use again.
Interesting that DuckDB doesn't provide any feedback that the memory limit has been reached in this case. That is something I can look into.
Thanks! I think I understand the process you describe is happening currently. I am not sure what is "feedback" in this situation and what is the expected behaviour.
Is it 1) It is expected that DuckDB will need to allocate the data in both DuckDB's memory (16GB) and R's memory (another16GB). Feedback in this context is telling the user that we can't respect the memory_limit?
or 2) It is expected that DuckDB will respect the memory limit in DuckDB's memory (memory_limit =1GB) so that the total RAM use should be 1GB (duckdb) + 16 GB (R). In this case, I'm not sure what feedback means.
Sorry, my mention of feedback wasn't clear.
To clarify, feedback from DuckDB would be some kind of notification to the user that DuckDB requires more memory than what is specified by the memory_limit.
Normally if duckdb attempts to use more memory than specified by the memory limit, an Out of memory
error is thrown. So I am curious as to why we don't see something like that here.
We have an internal issue to track this
gotcha , thanks again.
Just to make sure, there is no way to read/write data from duckdb database to R without using twice as much as RAM as the full data size at some point, correct?
@SimonCoulombe: Could this be because dbWriteTable()
doesn't actually write the table but creates a view to the data that still lives in R?
@cboettig: Is this still an issue? Is it the same as @SimonCoulombe's?
@SimonCoulombe: Could this be because
dbWriteTable()
doesn't actually write the table but creates a view to the data that still lives in R?
I'm pretty sure data is written to disk : the file is created and I can read it after restarting the R session.
I just wanted to apply a few thousand lines of SQL code to a CSV/fst/parquet without rewriting the wrangling code to R but I run out of RAM because the dataset's set in RAM is doubled. I ended up just processing the data in chunks.
@krlmlr still an issue though it may be a different one from @SimonCoulombe 's. I think my example above is a symptom of not all the aggregation methods being implemented out-of-core yet? as noted above, I believe my example is still crashing out of ram when executed as pure SQL without R involved.
Thanks!
@cboettig: If this can be replicated with the command-line client, an issue in the upstream repo https://github.com/duckdb/duckdb/ would be useful.
@SimonCoulombe: Is your issue still a problem with the latest dev version? You can also install from r-universe: https://duckdb.r-universe.dev/duckdb# . If yes, let's discuss this in a new issue.
@krlmlr thanks, it's been a while since I checked this in the CLI client. Just tested now with the identical code I provided above and it is working just fine in the cli, so it is indeed now just an R issue at this point.
I'm not using DBI writeTable explicitly above though, so I don't know if this is or isn't the same issue as @SimonCoulombe is reporting.
Again, here's the code that I run that works in the CLI, but fails when called via R. In the CLI, it uses just above 30GB (RES use peaks around 35.7 GB, VIRT use is higher but I understand that doesn't matter so much). Works fine in the CLI client, crashes in R.
PRAGMA memory_limit='30GB';
PRAGMA temp_directory='/tmp/duckdb';
INSTALL httpfs;
LOAD httpfs;
SET s3_endpoint='minio.carlboettiger.info';
SET s3_url_style='path';
# try view instead
create table observations as select * from read_parquet('s3://ebird/observations/*');
create table answer as SELECT *, count > 0.0 AS species_detected
FROM (
SELECT
sampling_event_identifier,
scientific_name,
SUM(observation_count) AS count
FROM (
SELECT sampling_event_identifier, scientific_name, observation_count
FROM observations
)
GROUP BY sampling_event_identifier, scientific_name
);
SET preserve_insertion_order=false;
copy answer to 'ebird.parquet' (FORMAT PARQUET);
Thanks. @cboettig, @SimonCoulombe: Is this still a problem with #90, which contains duckdb 0.10.0?
Thanks. @cboettig, @SimonCoulombe: Is this still a problem with #90, which contains duckdb 0.10.0?
on it -- let me check
( installing using remotes::install_github(repo="duckdb/duckdb-r", ref = remotes::github_pull(90))
)
Thanks. @cboettig, @SimonCoulombe: Is this still a problem with #90, which contains duckdb 0.10.0?
I'm afraid it didnt change anything.
I have installed the version from pull request 90 using remotes::install_github(repo="duckdb/duckdb-r", ref = remotes::github_pull(90))
and I have the same behaviour as before (32GB RAM use when writing a 16GB file, and also RAM use reaching 32GB momentarily when reading the same 16GB file).
library(DBI)
library(dplyr)
library(dbplyr)
library(duckdb)
duckdb_path <- "/devroot/sandbox/tmp/duckdb.duckdb"
con <- dbConnect(duckdb::duckdb(dbdir = duckdb_path))
dbExecute(con, "PRAGMA threads=1; PRAGMA memory_limit='1GB';")
# run this once to create the duckdb file then restart session:
if (FALSE){
bigdata <- data.table::rbindlist(rlang::rep_along(1:3e6, list(iris)))
dim(bigdata) # 450M rows, 5 columns
lobstr::obj_size(bigdata) # 16.20 GB in RAM
dbWriteTable(con, "straight_from_memory", bigdata)
}
bigdata <- tbl(con, "straight_from_memory") %>% collect()
hey all, I think I'm going crazy.
I think this peaks at 32GB ram use (before going down to 16GB when it is done reading).
bigdata <- tbl(con, "straight_from_memory") %>% collect()
but if I add to_arrow(), it peaks at only 16GB use
bigdata <- tbl(con, "straight_from_memory") %>% to_arrow() %>% collect()
The 32 GB vs 16GB peak is what I found by checking the RES column in htop, but the peakRAM package appears to agree there is a difference
peakRAM(bigdata <- tbl(con, "straight_from_memory") %>% collect())
Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 bigdata<-tbl(con,"straight_from_memory")%>%collect() 31.518 15449.9 15456.8
peakRAM(bigdata <- tbl(con, "straight_from_memory") %>% to_arrow() %>% collect())
Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 bigdata<-tbl(con,"straight_from_memory")%>%to_arrow()%>%collect() 20.912 0.7 4.2
@SimonCoulombe: The peakRAM package only measures memory allocated in R, although I agree there should be a difference.
I've open a new issue to investigate because this seems to be a different issue from what @cboettig is observing.
@cboettig: From looking at your example, I see no reason why it shouldn't work with the current dev version, soon to hit CRAN. That version contains exactly the same core as the v0.10.0 release.
I have split your script into an importing and a processing stage, the importing stage is now at about 2 GB with the progress still being at 0%. I wonder how much data this example contains. More important though is the question if we can downsize the example so that it still shows the problem without having to wait that long.
Thanks much @krlmlr , really appreciate this. Yes, this a large example (easy to check using s3, e.g.
> library(minioclient)
> mc_alias_set("ex", "minio.carlboettiger.info", "", "")
Added `ex` successfully.
> mc_du("ex/ebird/observations/")
152GiB 1124 objects ebird/observations
I'd love to find a smaller example that illustrates the problem. You could certainly run the command any one or any subset of the 1124 parquet files (mc_ls("ex/ebird/observations")
) listed there, but I'm not sure how to replicate the problem that way -- when I try a really small subset (e.g just part-0.parquet
say) it works just fine in about a second, even it I set something like "PRAGMA memory_limit='1GB';
. All the really small examples work great, the problem is just that the large real-world examples crash in R. I'm not clever enough to see how to reproduce that behavior with a small example
Thanks. The memory limit can be as small as 10 MB or a small multiple, that's already an awful lot of memory blocks that the database engine can shuffle around. True, some operations don't work at all if the memory limit is too aggressive, but perhaps there's a sweet spot?
If you still have a duckdb file with the data, could you try running the query with the most recent duckdb package from r-universe? Chances are it'll just work.
On a side note, I can't install the MinIO client on my macOS M1. Known issue?
@krlmlr sweet, I'll test now!
Thanks for mentioning the minioclient issue -- that sounds like a bug. It's just installing the binary from https://min.io/docs/minio/macos/index.html , (with CRAN permission to install a binary file), if you have a chance can you see if the official mac binary from there works?
Maybe I'm not detecting the arm architecture correctly? https://github.com/cboettig/minioclient/blob/main/R/install_mc.R
Thanks! Transient issue, works now.
I stand corrected -- a 10 MB memory limit doesn't seem to be a good simulation of reality, perhaps 125 MB per thread are more realistic?
still testing with this. I'm not entirely clear on difference between the options
SET memory_limit = '1GB';
SET max_memory = '1GB';
are these just aliases for the same thing? Also from https://duckdb.org/docs/configuration/pragmas#memory-limit , it sounds like these aren't enforced on certain aggregation operations
The specified memory limit is only applied to the buffer manager. For most queries, the buffer manager handles the majority of the data processed. However, certain in-memory data structures such as vectors and query results are allocated outside of the buffer manager. Additionally, aggregate functions with complex state (e.g., list, mode, quantile, string_agg, and approx functions) use memory outside of the buffer manager. Therefore, the actual memory consumption can be higher than the specified memory limit.
though I need to look more closely to understand if/when I encounter any of those cases.
What happens?
duckdb crashes when trying to perform queries on a large (~150 GB) parquet database
To Reproduce
Apologies I cannot reproduce this with a smaller dataset. I can reproduce the same errors using the remote S3 access though which might be a bit faster than downloading a full copy of the dataset, or you could probably download a copy from my server in a few days. reproducible examples of both remote and local-based access below. (The remote example goes via arrow because I cannot get direct access to S3 buckets via duckdb alone yet).
Or, after downloading from the above public S3 bucket (https://minio.carlboettiger.info/ebird/Mar-2022/observations), try local parquet access:
With the memory limit PRAGMA in place, the example doesn't crash the R session but does throw an OOM error:
Environment (please complete the following information):
Before Submitting
master
branch?pip install duckdb --upgrade --pre
install.packages("https://github.com/duckdb/duckdb/releases/download/master-builds/duckdb_r_src.tar.gz", repos = NULL)
by the way,
install.packages("https://github.com/duckdb/duckdb/releases/download/master-builds/duckdb_r_src.tar.gz", repos = NULL)
no longer works as there seems to no longer be master-builds download?