deephaven / deephaven-core

Deephaven Community Core
Other
257 stars 80 forks source link

Bug in BarrageMessageProducer.SnapshotControl.usePreviousValues causes us to always use previous #4586

Closed niloc132 closed 1 year ago

niloc132 commented 1 year ago
heduler-Concurrent-2 | i.d.s.s.SessionService    | Internal Error '12dfe072-4725-44f9-8dff-422240326a49' io.deephaven.UncheckedDeephavenException: Failed to execute function concurrently despite consistent state
    at io.deephaven.engine.table.impl.remote.ConstructSnapshot.callDataSnapshotFunction(ConstructSnapshot.java:1320)
    at io.deephaven.engine.table.impl.remote.ConstructSnapshot.callDataSnapshotFunction(ConstructSnapshot.java:1157)
    at io.deephaven.engine.table.impl.remote.ConstructSnapshot.constructBackplaneSnapshotInPositionSpace(ConstructSnapshot.java:708)
    at io.deephaven.server.barrage.BarrageMessageProducer.getSnapshot(BarrageMessageProducer.java:2215)
    at io.deephaven.server.barrage.BarrageMessageProducer.updateSubscriptionsSnapshotAndPropagate(BarrageMessageProducer.java:1323)
    at io.deephaven.server.barrage.BarrageMessageProducer$UpdatePropagationJob.run(BarrageMessageProducer.java:999)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at io.deephaven.server.runner.scheduler.SchedulerModule$ThreadFactory.lambda$newThread$0(SchedulerModule.java:78)
    at org.jpy.PyLib.callAndReturnObject(Native Method)
    at org.jpy.PyObject.call(PyObject.java:449)
    at io.deephaven.server.console.python.DebuggingInitializer.lambda$createInitializer$0(DebuggingInitializer.java:30)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.IllegalStateException: fillChunk precondition broken, invalid range. requested=[300, 319] (or subset), available=[0, 279]
    at io.deephaven.engine.table.impl.sources.ring.AbstractRingChunkSource.fillChunk(AbstractRingChunkSource.java:311)
    at io.deephaven.engine.table.impl.sources.ring.RingColumnSource.fillPrevChunk(RingColumnSource.java:190)
    at io.deephaven.engine.table.impl.sources.LongAsTimeSource.fillPrevChunk(LongAsTimeSource.java:100)
    at io.deephaven.engine.table.impl.sources.UnionColumnSource$FillContext.fillPrevChunkAppend(UnionColumnSource.java:385)
    at io.deephaven.engine.table.impl.sources.UnionColumnSource.fillPrevChunkFromMultipleSources(UnionColumnSource.java:534)
    at io.deephaven.engine.table.impl.sources.UnionColumnSource.fillPrevChunk(UnionColumnSource.java:520)
    at io.deephaven.engine.table.impl.sources.RedirectedColumnSource$FillContext.doOrderedFillAndPermute(RedirectedColumnSource.java:782)
    at io.deephaven.engine.table.impl.sources.RedirectedColumnSource.doFillChunk(RedirectedColumnSource.java:525)
    at io.deephaven.engine.table.impl.sources.RedirectedColumnSource.fillPrevChunk(RedirectedColumnSource.java:504)
    at io.deephaven.engine.table.impl.sources.RedirectedColumnSource$FillContext.doOrderedFillAscending(RedirectedColumnSource.java:805)
    at io.deephaven.engine.table.impl.sources.RedirectedColumnSource.doFillChunk(RedirectedColumnSource.java:520)
    at io.deephaven.engine.table.impl.sources.RedirectedColumnSource.fillPrevChunk(RedirectedColumnSource.java:504)
    at io.deephaven.engine.table.impl.sources.UnboxedTimeBackedColumnSource.fillPrevChunk(UnboxedTimeBackedColumnSource.java:98)
    at io.deephaven.engine.table.impl.remote.ConstructSnapshot.getSnapshotDataAsChunkList(ConstructSnapshot.java:1622)
    at io.deephaven.engine.table.impl.remote.ConstructSnapshot.serializeAllTable(ConstructSnapshot.java:1514)
    at io.deephaven.engine.table.impl.remote.ConstructSnapshot.lambda$constructBackplaneSnapshotInPositionSpace$2(ConstructSnapshot.java:703)
    at io.deephaven.engine.table.impl.remote.ConstructSnapshot.callDataSnapshotFunction(ConstructSnapshot.java:1242)
    ... 15 more

I don't have an exact reproducer for this, but here's the rough idea that gets to the structure of the ring table being fed from the blink table:

from deephaven.table import Table
from deephaven.table_factory import ring_table
from typing import Union, List

def blink_tail_by(blink_table: Table, num_rows: int, by: Union[str, List[str]]) -> Table:
    """
    Transform blink_table to ring_table that keeps num_rows by <group>
    Workaround as suggested in https://github.com/deephaven/deephaven-core/issues/4309

    """
    return (
        blink_table.without_attributes(["BlinkTable"])
        .partitioned_agg_by(aggs=[], by=by, preserve_empty=True)
        .transform(lambda t: ring_table(t, num_rows))
        .merge()
    )

upstream = time_table("PT1s", blink_table=True).update(["I=nanosToSeconds(epochNanos(Timestamp))", "J=I%100", "K=(int) (I/100)", "Data=`1,2,3,4`.split(`,`)"]).ungroup()

downstream = blink_tail_by(upstream, 100, "K")

The upstream table is better defined as a table publisher (i.e. a blink table) with contents from two sorted tables, merged into one table and published.

devinrsmith commented 1 year ago

I was able to reproduce this even without the partition_by. It always stemmed from a viewport against the end of a table and creating a sort on that table. After a lot of observation, I noted that every single invalid range wrt previous was valid wrt current. This led me to question how usePrev is being set, and ultimately to io.deephaven.server.barrage.BarrageMessageProducer.SnapshotControl#usePreviousValues where we find the bug. A recent refactoring of this logic in #3983 introduced this bug by updating a variable name away from being shaded, but didn't update the other references of it to the new variable name. This caused usePrevious to always be true for this specific code path. Most of the time this mis-use of usePrevious would be OK since the previous ring was up to date with the current ring; during exceptional times though (when previous ring != current ring), this could result in errors like the stacktrace we see above. In particular, row sets viewports at the very end of the ring table are more more likely to be invalid wrt previous... in other cases where the user is at the beginning or middle of a table, it is much more likely they will just receive stale data without error.

Ultimately, the web UI sort was just the catalyst to cause a new barrage producer snapshot and had no direct bearing on the cause of this issue (nor did the specific table operations on the query; save for the strictness check for ring tables that originally brought this to our attention).