finos / perspective

A data visualization and analytics component, especially well-suited for large and/or streaming datasets.
https://perspective.finos.org/
Apache License 2.0
8.42k stars 1.17k forks source link

JSON serialization error when updating hosted table with NaN values #1985

Closed 0x26res closed 2 years ago

0x26res commented 2 years ago

Bug Report

Steps to Reproduce:

Run this self contained python script and go to http://localhost:8081

import logging
import threading

import tornado.ioloop
import tornado.web
from perspective import PerspectiveManager, PerspectiveTornadoHandler
import perspective
import pyarrow as pa
import numpy as np

INDEX = """
<!DOCTYPE html>
<html>
    <head>
        <meta
            name="viewport"
            content="width=device-width, initial-scale=1, maximum-scale=1, minimum-scale=1, user-scalable=no"
        />

        <script src="https://cdn.jsdelivr.net/npm/@finos/perspective-viewer@1.6.5/dist/umd/perspective-viewer.js"></script>
        <script src="https://cdn.jsdelivr.net/npm/@finos/perspective-viewer-datagrid@1.6.5/dist/umd/perspective-viewer-datagrid.js"></script>
        <script src="https://cdn.jsdelivr.net/npm/@finos/perspective-viewer-d3fc@1.6.5/dist/umd/perspective-viewer-d3fc.js"></script>
        <script src="https://cdn.jsdelivr.net/npm/@finos/perspective@1.6.5/dist/umd/perspective.js"></script>

        <link
            rel="stylesheet"
            crossorigin="anonymous"
            href="https://cdn.jsdelivr.net/npm/@finos/perspective-viewer@1.6.5/dist/umd/themes.css"

        />

        <style>
            perspective-viewer {
                position: absolute;
                top: 0;
                left: 0;
                right: 0;
                bottom: 0;
            }
        </style>
    </head>

    <body>
        <perspective-viewer id="viewer" ,> </perspective-viewer>

        <script>
            window.addEventListener("DOMContentLoaded", async function () {
                const viewer = document.getElementById("viewer");
                const websocket = perspective.websocket(
                    "ws://localhost:8081/websocket"
                );
                const table = websocket.open_table("table_name");
                viewer.load(table);
            });
        </script>
    </body>
</html>
"""

class MainHandler(tornado.web.RequestHandler):
    _tables = None
    _default_table = None

    async def get(self, path: str) -> None:
        await self.finish(INDEX)

def table_to_bytes(table: pa.Table) -> bytes:
    with pa.BufferOutputStream() as sink:
        with pa.ipc.new_stream(sink, table.schema) as writer:
            for batch in table.to_batches():
                writer.write_batch(batch)
        return sink.getvalue().to_pybytes()

def perspective_thread(manager, table: perspective.Table, updater):
    psp_loop = tornado.ioloop.IOLoop()
    manager.set_loop_callback(psp_loop.add_callback)
    manager.host_table("table_name", table)
    callback = tornado.ioloop.PeriodicCallback(callback=updater, callback_time=1000)
    callback.start()
    psp_loop.start()

def bug_here():
    arrow_table = pa.table(
        [
            pa.array([1, 3], pa.float64())
        ],
        names=["column1"]
    )
    perspective_table = perspective.Table(table_to_bytes(arrow_table))
    manager = PerspectiveManager()

    def updater():
        update = pa.table(
            [
                pa.array([np.NAN], pa.float64())
            ],
            names=["column1"]
        )
        perspective_table.update(table_to_bytes(update))

    thread = threading.Thread(target=perspective_thread, args=(manager, perspective_table, updater), daemon=True)
    thread.start()

    app = tornado.web.Application(
        [
            (
                r"/websocket",
                PerspectiveTornadoHandler,
                {"manager": manager, "check_origin": True},
            ),
            (
                r"/(.*)",
                MainHandler
            )
        ]
    )
    app.listen(8081)
    loop = tornado.ioloop.IOLoop.current()
    loop.start()

if __name__ == "__main__":
    logging.info("Hosting in http://localhost:8081")
    bug_here()

For context, this is mainly borrowed from https://github.com/finos/perspective/blob/master/examples/python-tornado-streaming/index.html

Expected Result:

The table should update and append a row with nan value on every cycle.

Actual Result:

No update happens and I see this error in python:

WARNING:root:JSON serialization error: Cannot serialize `NaN`, `Infinity` or `-Infinity` to JSON.

And this error in the web browser console:

Uncaught (in promise) JSON serialization error: Cannot serialize `NaN`, `Infinity` or `-Infinity` to JSON.

Note: the error only happens when trying to display the table in the browser.

Environment:

Additional Context:

My understanding is that the data is sent to the browser using Arrow IPC, which should be able to pass nan values. I don't understand why json gets in the picture and at which level.

The workaround I found so far is to replace nan doubles with missing value in arrow, but it's hard to systematise.

texodus commented 2 years ago
const websocket = perspective.websocket(
    "ws://localhost:8081/websocket"
);
const table = websocket.open_table("table_name");
viewer.load(table);

This code is telling the viewer frontend component to use the Table from Python, e.g. to not even instantiate the engine on the client side. Since there is no engine, there is no capability to read Arrow data. The UI uses JSON/Javascript data serialization e.g. when you scroll up and down in the viewport and data needs to be fetched to render to the screen, and when the engine is in Python, it must be further serialized as stringified-json across WebSocket, and Python cannot serialize NaN without a special (IIRC global?) handler.

You can avoid this and get Arrow encoding on the wire, by passing the virtual server-side table to a client-side table constructor, which will decode using the client side engine as in the "Data Binding" section of the docs:

const websocket = perspective.websocket(
    "ws://localhost:8081/websocket"
);
const worker = perspective.worker();
const server_table = await websocket.open_table("table_name");

// Get a view with no params
const server_view = await server_table.view();

// Construct a table on the client side that replicates this view - it will 
// read from the server with `to_arrow()`
const table = worker.table(server_view);

// Load client
viewer.load(table);

This could be made more developer friendly for sure, and there may be a way to use arrow via a separate wasm decoder without the engine itself in the future (wasm doesn't make dynamic module loading painless atm).

However - while this should "work", Perspective in general does not handle NaN "correctly". We made a decision early on to try to replace these with None/null in the host language, so while the above will not crash Python, it may not return what you expect if you're explicitly calculating NaN results.

0x26res commented 2 years ago

@texodus first of all thanks for this very detailed answer, it's very helpful.

I gave it a try, and it solved the issue.

But with this set up I've noticed the a small change of behaviour. The table in the UI doesn't take into consideration the index set in the server. So whenever a record updates it gets appended. Does it mean with that set up I need to specify the index column in the frontend as well?

One small difference is that I had to specify the index in the frontend

Here's an updated reproducible example (with an index):

import logging
import threading

import tornado.ioloop
import tornado.web
from perspective import PerspectiveManager, PerspectiveTornadoHandler
import perspective
import pyarrow as pa
import numpy as np

INDEX = """
<!DOCTYPE html>
<html>
    <head>
        <meta
            name="viewport"
            content="width=device-width, initial-scale=1, maximum-scale=1, minimum-scale=1, user-scalable=no"
        />

        <script src="https://cdn.jsdelivr.net/npm/@finos/perspective-viewer@1.6.5/dist/umd/perspective-viewer.js"></script>
        <script src="https://cdn.jsdelivr.net/npm/@finos/perspective-viewer-datagrid@1.6.5/dist/umd/perspective-viewer-datagrid.js"></script>
        <script src="https://cdn.jsdelivr.net/npm/@finos/perspective-viewer-d3fc@1.6.5/dist/umd/perspective-viewer-d3fc.js"></script>
        <script src="https://cdn.jsdelivr.net/npm/@finos/perspective@1.6.5/dist/umd/perspective.js"></script>

        <link
            rel="stylesheet"
            crossorigin="anonymous"
            href="https://cdn.jsdelivr.net/npm/@finos/perspective-viewer@1.6.5/dist/umd/themes.css"

        />

        <style>
            perspective-viewer {
                position: absolute;
                top: 0;
                left: 0;
                right: 0;
                bottom: 0;
            }
        </style>
    </head>

    <body>
        <perspective-viewer id="viewer" ,> </perspective-viewer>

        <script>
            window.addEventListener("DOMContentLoaded", async function () {
                const websocket = perspective.websocket(
                "ws://localhost:8081/websocket"
            );
            const worker = perspective.worker();
            const server_table = await websocket.open_table("table_name");

            // Get a view with no params
            const server_view = await server_table.view();

            // Construct a table on the client side that replicates this view - it will 
            // read from the server with `to_arrow()`
            const table = worker.table(server_view,  { index: "key" });

            // Load client
            viewer.load(table);
            });
        </script>
    </body>
</html>
"""

class MainHandler(tornado.web.RequestHandler):
    _tables = None
    _default_table = None

    async def get(self, path: str) -> None:
        await self.finish(INDEX)

def table_to_bytes(table: pa.Table) -> bytes:
    with pa.BufferOutputStream() as sink:
        with pa.ipc.new_stream(sink, table.schema) as writer:
            for batch in table.to_batches():
                writer.write_batch(batch)
        return sink.getvalue().to_pybytes()

def perspective_thread(manager, table: perspective.Table, updater):
    psp_loop = tornado.ioloop.IOLoop()
    manager.set_loop_callback(psp_loop.add_callback)
    manager.host_table("table_name", table)
    callback = tornado.ioloop.PeriodicCallback(callback=updater, callback_time=1000)
    callback.start()
    psp_loop.start()

def bug_here():
    arrow_table = pa.table(
        [
            pa.array(["a", "b"], pa.string()),
            pa.array([1, 3], pa.float64())
        ],
        names=["key", "value"]
    )
    perspective_table = perspective.Table(table_to_bytes(arrow_table), index="key")
    manager = PerspectiveManager()

    def updater():
        update = pa.table(
            [
                pa.array(["c"], pa.string()),
                pa.array([np.NAN], pa.float64())
            ],
            names=["key", "value"]
        )
        perspective_table.update(table_to_bytes(update))

    thread = threading.Thread(target=perspective_thread, args=(manager, perspective_table, updater), daemon=True)
    thread.start()

    app = tornado.web.Application(
        [
            (
                r"/websocket",
                PerspectiveTornadoHandler,
                {"manager": manager, "check_origin": True},
            ),
            (
                r"/(.*)",
                MainHandler
            )
        ]
    )
    app.listen(8081)
    loop = tornado.ioloop.IOLoop.current()
    loop.start()

if __name__ == "__main__":
    logging.info("Hosting in http://localhost:8081")
    bug_here()

PS1: there's a small typo in your answer, const table = worker.table(view); should be const table = worker.table(server_view); PS2: it's probably worth mentioning in this doc https://perspective.finos.org/docs/server#javascript-client-1 that in this mode the transport layer is json instead of arrow. PS3: I'll probably stick to your suggestion of replacing NaN with missing values in arrow.

texodus commented 2 years ago

You need to supply the index to the client-side table as well, something like this:

const table = worker.table(view, {index: "My Column Index"});

This should probably be inherited automatically, but there are scenarios where you want to set these differently on client and server.