duckdb / duckdb-r

The duckdb R package
https://r.duckdb.org/
Other
126 stars 24 forks source link

Install extension when using webr #66

Open dicorynia opened 9 months ago

dicorynia commented 9 months ago

It seems that when installing a duckdb extension (like httpfs) using r-wasm/webr and coatless/quarto-webr duckdb should use HTTPS for the connection to work. However it currently uses plain HTTP http://extensions.duckdb.org/

See coatless/quarto-webr#139

Is it possible to force HTTPS? Should I report this problem elsewhere?

krlmlr commented 8 months ago

Thanks. Can this be reproduced using the Node.JS API?

https://duckdb.org/docs/api/nodejs/overview.html

krlmlr commented 8 months ago

I'm confusing things. Probably we want to reproduce with wasm, but the shell in https://duckdb.org/docs/api/wasm/overview doesn't show where the package is installed from.

@georgestagg: Would you know which component is actually responsible for picking the download location, or how to understand this?

georgestagg commented 8 months ago

I spent some time today reading the duckdb and duckdb-wasm source code to try and understand what is going on. It turns out that the situation is much more complicated than OP originally thought and there are actually two unrelated issues here:

  1. Loading the HTTPFS extension is not supported in duckdb-wasm (See https://duckdb.org/docs/api/wasm/extensions.html#httpfs)

  2. Loading duckdb extensions does not work under webR in general.

No HTTPFS in duckdb-wasm

Loading the HTTPFS extension is not supported in duckdb-wasm. You can see this failure also happening in the DuckDB shell at https://shell.duckdb.org:

DuckDB Web Shell
Database: v0.10.0
Package:  @duckdb/duckdb-wasm@1.28.1-dev132.0

Connected to a local transient in-memory database.
Enter .help for usage hints.

duckdb> INSTALL httpfs;

duckdb> LOAD httpfs;
IO Error: Extension https://extensions.duckdb.org/v0.10.0/wasm_eh/httpfs.duckdb_extension.wasm is not available

duckdb> 

Instead, in the DuckDB shell you can just use URLs without loading HTTPFS at all:

duckdb> CREATE VIEW prenoms AS
   ...>     SELECT * 
   ...>     FROM 'https://static.data.gouv.fr/resources/base-prenoms-insee-format-parquet/20231121-161435/prenoms-nat2022.parquet';
┌┐
└┘

duckdb> select * from prenoms limit 1;
┌──────┬────────────────┬────────┬────────┐
│ sexe ┆ preusuel       ┆ annais ┆ nombre │
╞══════╪════════════════╪════════╪════════╡
│    1 ┆ _PRENOMS_RARES ┆ 1900   ┆   1249 │
└──────┴────────────────┴────────┴────────┘

This works because duckdb-wasm or duckdb-wasm-shell starts up a DuckDB browser runtime in the JavaScript environment and makes it available under the name BROWSER_RUNTIME. The runtime includes bindings for performing some operations, such as making a HTTP request, when running in a restricted Wasm environment. See, for example, the code at: https://github.com/duckdb/duckdb-wasm/blob/main/packages/duckdb-wasm/src/bindings/runtime_browser.ts

The reason that this does not work under webR is because the duckdb R package does not set up such a runtime. It would be tricky, but it's possible that the R package could be extended to check if it is running under the "emscripten" OS, and then if so use webr::eval_js() to execute JavaScript code to similarly start up a BROWSER_RUNTIME in the webR worker thread.

My current understanding is that if this were set up, duckdb would use the bindings for downloading data from HTTP URLs transparently, without having to load HTTPFS extension (which doesn't currently exist for Wasm).

Loading duckdb extensions does not work under webR.

This is a related but distinct issue. For example, in the duckdb Wasm shell, the command INSTALL autocomplete invokes a request for: https://extensions.duckdb.org/v0.10.0/wasm_eh/autocomplete.duckdb_extension.wasm

But at https://webr.r-wasm.org/v0.2.2/ the script,

webr::install("duckdb")
library(duckdb)
cnx <- dbConnect(duckdb())
dbExecute(cnx, "INSTALL autocomplete")

invokes a request for: http://extensions.duckdb.org/v0.8.1/linux_i686_gcc4/autocomplete.duckdb_extension.gz. Not only is this the incorrect URL, but the socket connection is converted into a WebSocket connection by Emscripten and then fails because no infrastructure has been setup to handle that.

The duckdb source file src/main/extension/extension_load.cpp contains a preprocessor switch WASM_LOADABLE_EXTENSIONS. This enables using JS XHR requests to download extensions, avoiding the step where Emscripten converts the request into a WebSocket connection. So, first of all, that needs to be turned on when compiling the duckdb R package for webR.

I made a tweak to src/Makevars[.in] so that a new Make variable WASM_CPPFLAGS is included during the R package build. Then, I tweaked configure so that WASM_CPPFLAGS = -DWASM_LOADABLE_EXTENSIONS is appended to Makevars.duckdb during the configure step, if the OS reports itself as Emscripten (i.e. when we're compiling for Wasm).

This successfully causes the browser to make an XHR request, but even so extension downloading still does not work.


I kept digging and found in duckdb-wasm there is a file duckdb.patch. The patch does some work to exactly this part of the code. This seems a little strange to me, I do not know why the change is maintained in a patchfile here rather than upstreamed into the main duckdb repo. Then again, webR does the same thing with R, so I cannot judge.

I have to leave things now for the moment, but it looks like for extension downloading to work the duckdb R package would have to similarly apply such a patch to the vendored duckdb source, possibly at compile time and only when compiling for Wasm.

Additionally, the incorrect URLs would also need to be taken care of. Possibly setting up the BROWSER_RUNTIME would fix both of these problems, but I cannot be sure.

krlmlr commented 8 months ago

Wow, thank you for the very detailed analysis!

Did you by any chance compile duckdb from the current main branch? Installing extensions here won't work because they are tied to a specific commit of the duckdb core. Can you try with the v0.9.2-1 tag? There will be a version v0.10.0 soon, too.

georgestagg commented 8 months ago

Did you by any chance compile duckdb from the current main branch?

I tried with the current version in the webR package repo (v0.8.1-3, I think) and the latest version on main. The behaviour is the same, except the extension URL is different in main:

https://extensions.duckdb.org/2414840843/linux_i686_gcc4/autocomplete.duckdb_extension.wasm

This is still the wrong system architecture, though, even if the version were correct and the request had succeeded.

krlmlr commented 8 months ago

Thanks. Let's revisit this when duckdb 0.10.0 is on CRAN. The extension version must match the version of the duckdb core, and the R package is currently behind. Also, chances are that this works better out of the box.

krlmlr commented 7 months ago

duckdb 0.10.0 is on CRAN now, with duckdb 0.10.1 scheduled for next week, and much more time before the following release. Any action we should take now?

georgestagg commented 7 months ago

Any action we should take now?

I'm working on releasing webR 0.3.0 very soon, so it is probably best to wait until after duckdb 0.10.1. Once the dust settles from the webR release I'll try to find some time to continue the investigation above.

krlmlr commented 7 months ago

@Antonov548: Can you please try replicating George's findings in the "Loading duckdb extensions does not work under webR" section in https://github.com/duckdb/duckdb-r/issues/66#issuecomment-1959769901 ?

Antonov548 commented 6 months ago

Hello @georgestagg

Thanks for such detailed explanation, it's very usefull. As I understand in the webR all .Call's to C code converted to call's of WebAssembly module.

I was trying to find exact place where it's happening. I just wondering if you have implemented another solution for downloading the packages with C compiled code and execution of it. I guess it might help to resolve the problem with downloading of extensions and execution the logic inside of them.

May you point me to exact place where calls to WebAssembly implemented if there any?

seanbirchall commented 2 months ago

@dicorynia you can probably hack it for now. It's not quarto, but here is an R shinylive workaround that pulls in duckdb wasm and makes it queryable from R... I'm not the owner of the remote file queried in this, but followed a tutorial using it (@runsascoded) thanks for the file and code.

https://github.com/seanbirchall/shinylive-duckdb-wasm