duckdb / duckdb-wasm

WebAssembly version of DuckDB
https://shell.duckdb.org
MIT License
1.19k stars 125 forks source link

Getting "IO Error: GDAL Error (4): `<file>.xlsx' not recognized as a supported file format." when calling st_read on registered .xlsx file #1870

Open rpbouman opened 1 week ago

rpbouman commented 1 week ago

What happens?

Error: IO Error: GDAL Error (4):test.xlsx' not recognized as a supported file format.`

To Reproduce

I'm running this page locally:

<!doctype html>
<html>
  <head>
    <title>DuckDB WASM: Reading XSLX</title>
  </head>
  <body>
    <div id="content">
      <div id="initialization">
        <div id="status">Initializing, please wait...</div>
        <ul id="initActions">
        </ul>
      </div>
      <div style="display:none" id="ui">
        <label>Choose a file<label>: <input type="file" id="fileInput" accept=".xlsx, application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"/>
      </div>
    </div>
    <script type="application/javascript">
      function statusUpdate(action) {
        var list = document.getElementById('initActions');
        list.innerHTML += `<li>${action}</li>`;
      }

      function alertError(message, error){
        var message = `${message}: ${error.message}\n\n${error.stack}`;
        console.error(message);
        alert(message);
      }

      function afterInit(){
        document.getElementById('initialization').style.display = 'none';
        document.getElementById('fileInput').addEventListener('change', async function(event){
          var files = event.currentTarget.files;
          if (files.length === 0){ 
            alert('No file selected!');
            return;
          }

          var file = files[0];
          var duckdb = window.duckdb;
          try {
            await duckdb.instance.registerFileHandle(
              file.name, 
              file, 
              duckdb.api.DuckDBDataProtocol.BROWSER_FILEREADER, 
              true
            );
            alert(`File registered!`);
          }
          catch (e){
            alertError('Error registering file', e);
            return;
          }

          try {
            var sql = `SELECT * FROM st_read('${file.name}')`;
            var result = await duckdb.connection.query(sql);
            debugger;
          }
          catch(e){
            alertError('Error executing query on file', e);
            return;
          }
        });
        document.getElementById('ui').style.display = 'block';
      }
    </script>
    <script type="module">
      import * as duckdb from 'https://cdn.jsdelivr.net/npm/@duckdb/duckdb-wasm@1.28.1-dev287.0/+esm';

      statusUpdate('module loaded');

      const JSDELIVR_BUNDLES = duckdb.getJsDelivrBundles();
      const bundle = await duckdb.selectBundle(JSDELIVR_BUNDLES);
      statusUpdate('jsdelivr bundle selected');

      const worker_url = URL.createObjectURL(
        new Blob([`importScripts("${bundle.mainWorker}");`], {type: 'text/javascript'})
      );
      const worker = new Worker(worker_url);
      const logger = new duckdb.ConsoleLogger();
      const db = new duckdb.AsyncDuckDB(logger, worker);
      URL.revokeObjectURL(worker_url);

      await db.instantiate(bundle.mainModule, bundle.pthreadWorker);
      statusUpdate('DuckDB instance created');

      const connection = await db.connect();
      statusUpdate('DuckDB connection created');

      var versionResult = await connection.query(`SELECT version() as versioninfo`);
      var versionInfo = versionResult.get(0)['versioninfo'];
      statusUpdate(`DuckDB ${versionInfo} initialized.`);
      console.log(`DuckDB version: ${versionInfo}`) ;

      // expose duckdb as global
      window.duckdb = {
        api: duckdb,
        instance: db,
        connection: connection
      };

      statusUpdate('Installing spatial extension');
      await connection.query('INSTALL SPATIAL');
      statusUpdate('Loading spatial extension');
      await connection.query('LOAD SPATIAL');
      afterInit();
    </script>

  </body>
</html>

Sample xlsx file is attached. test.xlsx

Browser/Environment:

Chrome 129.0.6668.71, MS Edge 129.0.2792.65

Device:

Windows 11 Desktop

DuckDB-Wasm Version:

using https://cdn.jsdelivr.net/npm/@duckdb/duckdb-wasm@1.28.1-dev287.0/+esm

DuckDB-Wasm Deployment:

https://cdn.jsdelivr.net/npm/@duckdb/duckdb-wasm@1.28.1-dev287.0/+esm

Full Name:

Roland Bouman

Affiliation:

EPAM Systems BV Netherlands

rpbouman commented 1 week ago

I should have included: I can open the file fine using duckdb cli:

DuckDb> SELECT * FROM st_read( 'C:\roland\projects\duckdb-wasm-snippets\3. reading xlsx\test.xlsx' );
┌────────────┬──────────┐
│   Fruit    │ Calories │
│  varchar   │  int32   │
├────────────┼──────────┤
│ Apple      │       52 │
│ Banana     │       89 │
│ Cantaloupe │       39 │
└────────────┴──────────┘
Run Time (s): real 0.045 user 0.000000 sys 0.015625
changes:   0   total_changes: 0
DuckDb>
carlopi commented 1 week ago

Question, can the file be opened when queried from an http endpoint?

Something like:

LOAD spatial;
SELECT * FROM st_read( 'http://127.0.0.1:8081/<filename>' )

while having a local http-server?

Question is basically whether this problem is with reading the format at all in Wasm (possibly due to different config in DuckDB-Wasm spatial) or this is a filesystem mismatch within duckdb-wasm. Thanks

carlopi commented 1 week ago

Actually this seems to work, at least on the shell, from a remote endpoint: https://shell.duckdb.org/#queries=v0,LOAD-spatial~,SUMMARIZE-FROM-st_read('https%3A%2F%2Fraw.githubusercontent.com%2Fduckdb%2Fduckdb%20rs%2Fmain%2Fcrates%2Fduckdb%2Fexamples%2FMovies_Social_metadata.xlsx')~ (it's another file, but should likely be independent)

rpbouman commented 1 week ago

Hi @carlopi - I only just noticed you had two replies.

I can confirm that I can read it with DuckDB WASM when I call st_read explicitly and pass in the http url.

Interestingly, the example with the http url currently fails in the CLI (no WASM) with DuckDB 1.1.0 GA and the nightly build v1.1.2-dev44:

image

carlopi commented 1 week ago

I think in the CLI explicitly loading httpfs is needed:

LOAD https;
LOAD spatial;
SUMMARIZE FROM st_read('https://raw.githubusercontent.com/duckdb/duckdb-rs/main/crates/duckdb/examples/Movies_Social_metadata.xlsx');

then works on my end.

carlopi commented 1 week ago

And @rpbouman, on your previous comment: I was trying to test (with very low effort) whether this was supported at all by spatial in wasm / find where things go wrong, and it appears it's connected to the File-based filesystem. I will need to dig there, log system calls, and find where stuff start to differ (and why / how to fix it).

rpbouman commented 1 week ago

I think in the CLI explicitly loading httpfs is needed: ... then works on my end.

Ah! Ok, I thought that was enabled by default. After installing it explicitly, it works for me too. THanks for clearing that up!