CSV update fails on schema-defined table but succeeds on creation

finos / perspective

A data visualization and analytics component, especially well-suited for large and/or streaming datasets.

https://perspective.finos.org/

Apache License 2.0

7.72k stars 1.05k forks source link

CSV update fails on schema-defined table but succeeds on creation #2524

Closed kryaksy closed 5 months ago

kryaksy commented 5 months ago

Bug Report

When updating a table created with a schema using a CSV file, an error is thrown, but the same error is not thrown when creating the table with the same CSV.

Steps to Reproduce:

const csv = `field\n2147483648`;

const tableFromCsv = await worker.table(csv);
// No error

const tableFromSchema = await worker.table({ field: "integer" });
await tableFromSchema.update(csv);
// Update failed: Error: Abort(): Invalid: In CSV column #0: Row #2: CSV conversion error to int32: invalid value '2147483648'

Expected Result:

Consistent behavior between updating a schema-defined table and creating a table with the same CSV.

Actual Result:

While one of them succeeds, the other one throws error.

Environment:

perspective@2.7.2

timkpaine commented 5 months ago

Your first schema isn't going to match your second schema, the number you picked exceeds int32

kryaksy commented 5 months ago

Schemas are the same:

const csv = `field\n2147483648`;
const tableFromCsv = await worker.table(csv);
console.log(await tableFromCsv.schema());
// {field: 'integer'}

As you see, even the value exceeds int32, schema response is integer.

timkpaine commented 5 months ago

You're in JS and int is int32. CSV is not JavaScript, so it can use e.g. int64.

timkpaine commented 5 months ago

I don't see any issue here, csv or arrow or python will allow for large integers whereas JS does not. So constructing something from a csv or arrow or python vs from JS will have minor differences as a result of the language and format differences.

ref: https://github.com/finos/perspective/issues/1346

texodus commented 4 months ago

Perspective currently only officially supports 32bit signed integer for "integer" type columns, because JavaScript only had native support for this type (at least, when Perspective was originally published). If you want numbers larger than that, you'll need to use "float" typed column (64bit float).

While I'd prefer this to be consistent behavior between platforms, and Perspective may soon support BigInt types as they have wide platform support today, it won't change the fact that "integer" type will still be 32bit. Ergo in your case, the fix is not to accept this value in JavaScript, but rather to reject it in Python.

kryaksy commented 4 months ago

@texodus Thanks for the explanation. Based on the what you said, it would be correct to expect this field to be recognized as a 'float' when I create a table with a csv containing a 64-bit value, right?

// Csv containing 64-bit value
const csv = `field\n2147483648`;

const tableFromCsv = await worker.table(csv);
console.log(await tableFromCsv.schema());
// Expected {field: 'float'}
// Received {field: 'integer'}

This is the issue from my perspective. Am I missing something?

timkpaine commented 4 months ago

No, you're mixing CSV and javascript. If you're ingesting as csv, the internal type will be int64. But this doesn't exist in javascript, so when you query FROM JAVASCRIPT it gets truncated to int32 (both types represented in schema as "integer").

If you predefine your schema as float and then ingest as csv, then from javascript the stuff will come out as float.

texodus commented 4 months ago

Perspective's inference support is a best-guess meant for user convenience. We implemented it in a way that has minimal/negligible impact on performance, which would not be the case if we validated every cell (and in general it is not possible to always infer user intent for ambiguous columns).

I'm not against improving this support to handle more cases, but inference support is not meant to be complete. In this case (and others inevitably), you will need to provide a schema.

kryaksy commented 4 months ago

I clearly grasp the limitations and considerations you've detailed. Thank you both for the thorough explanation. As referenced in this discussion, we're developing a feature necessitating CSV validation by Perspective. Following that, I implemented your advice to use the table creation phase for validation. Yet, based on these discussions, it appears an additional validation step outside of Perspective might be required.