deephaven / deephaven-core

Deephaven Community Core
Other
258 stars 80 forks source link

Int with bitWidth=64 not supported in parquet reader ? #4788

Open jonathandune opened 1 year ago

jonathandune commented 1 year ago

Description

The parquet reader in io.deephaven.parquet.table.ParquetSchemaReader doesn't seem to handle certain integer types.

Steps to reproduce

  1. Download the example parquet file
  2. Execute source = read('/tmp/file.parquet')
  3. Logs will show :

ERROR r-Scheduler-Serial-1 | .c.ConsoleServiceGrpcImpl | Error running script: java.lang.RuntimeException: Error in Python interpreter: Type: <class 'deephaven.dherror.DHError'> Value: failed to read parquet data. : RuntimeError: io.deephaven.UncheckedDeephavenException: Unable to read column [gas_used]: IntLogicalType, isSigned=false, bitWidth=64 not supported Traceback (most recent call last): File "/opt/deephaven/venv/lib/python3.10/site-packages/deephaven/parquet.py", line 121, in read return Table(j_table=_JParquetTools.readTable(path)) RuntimeError: io.deephaven.UncheckedDeephavenException: Unable to read column [gas_used]: IntLogicalType, isSigned=false, bitWidth=64 not supported at io.deephaven.parquet.table.ParquetSchemaReader.lambda$readParquetSchema$2(ParquetSchemaReader.java:274) at java.base/java.util.Optional.orElseThrow(Optional.java:403) at io.deephaven.parquet.table.ParquetSchemaReader.readParquetSchema(ParquetSchemaReader.java:268) at io.deephaven.parquet.table.ParquetTools.convertSchema(ParquetTools.java:842) at io.deephaven.parquet.table.ParquetTools.readTableInternal(ParquetTools.java:551) at io.deephaven.parquet.table.ParquetTools.readTable(ParquetTools.java:80)

Expected results

I expected the parquet file to be read

Versions

rcaudy commented 1 year ago

Looks like this is unsigned, 64-bit. Supporting it would require reading into a Java BigInteger or some third-party type, both unappealing options for a high performance library. The alternative of reading into a signed long seems just plain incorrect.

rcaudy commented 1 year ago

That said, I think compromised support is better than no support. @jonathandune do you have a preference for your use case? BigInteger is probably the most natural answer here.

rcaudy commented 1 year ago

https://guava.dev/releases/20.0/api/docs/com/google/common/primitives/UnsignedLong.html is a 3rd party option.

jcferretti commented 1 year ago

@jonathandune Can you point us explicitly to the specific example parquet file you seem to be referencing in the description? Thanks!

jcferretti commented 1 year ago

I'll give some context here in case it helps. Our engine is implemented in Java, which makes it hard for us to support unsigned types in general. This is possible, but is not something we currently support and would require considerable effort. We are not alone here btw, eg, this is a common theme for Java based data engines: https://issues.apache.org/jira/browse/SPARK-34816.

Depending on your use case, a short term option is to load the data to a different type supporting the full range (eg, BigInteger or BigDecimal). We may be able to help with that, depending on your actual use case (can you provide an example file?). Also I want to mention, if your data is using unsigned as a matter of semantics but you happen to know from your use case that you are not using the upper range of unsigned (ie, the unsigned values that won't fit on the signed type of the same bit width), then a much simpler alternative is to read the data as signed and just operate on it as that, signed. YMMV here depending on use case.

If you do use the upper range, a different alternative to BigInteger or Bigdecimal would be to still load as signed but consistently use the Guava library's UnsignedLong (linked by Ryan in an earlier comment) to wrap all operations (eg, if you need a column operation adding two values, it would need to be wrapped through calls to functions in that library) and presentation (eg, you would need to convert to String for presentation using that library). This would be a lot more "heavy lifting" and a tax on everything you do for a column of this type, so this path is not advisable unless you really need that upper range and you can't live with just the positive range of the signed type.

jonathandune commented 1 year ago

Ooops sorry I forgot to share the url of the parquet file !! https://filebin.net/2fer0fw8fvp8041c

jonathandune commented 1 year ago

The context of the use case is blockchain data in parquet files. Large unsigned numbers are common in the blockchain space (256 bit integers are the norm for some standard datasets)

jcferretti commented 1 year ago

Maybe crazy idea: if those long bit width integers (eg 256 as your example above) are used as IDs and not to actually do math on them, you may be better off loading them in DH as a byte array, or even more manageable, a hex string (or even shorter, base64 string). I think as a (sane, uniquely defined for this) hex or base64 string, both equality and inequality/range operations would work, which should suffice for id like data. :man_shrugging:

To do this on the 64 bit case, you would first load the data as signed, and then use java code that is careful to look at the column data "as unsigned" to compute a new table with the new column.

But what I think is interesting here is, if you already have to deal with 256 bit integers, and since parquet doesn't support that, maybe you are already dealing with those 256 bit integers in some other way. Whatever you are doing there potentially would make sense to do for 64 bit integers too, for consistency.

malhotrashivam commented 1 year ago

Hi @jonathandune, tagging you here so we don't lose track of the conversation. Do you have a preference on what type the unsigned 64 bit ints should map to? Ryan and Cristian have suggested some approaches above. We can provide any further clarifications if you need any.

jonathandune commented 1 year ago

There are cases where arithmetic will be required on those types so I think going with something standard like BigInteger might make sense ? The parquet file I've provided has been generated using https://github.com/paradigmxyz/cryo there are some details on types used there and seems multiple fields would be impacted

malhotrashivam commented 1 year ago

Hi @jonathandune, thanks for the reply. I will check internally and share the next steps with you soon.

devinrsmith commented 4 months ago

I think BigInteger should be the default for this case; ideally, I'd like the ability to configure on a column-by-column basis to change the desired behavior. I could imagine some users preferring to keep as long and if needed, work with special math like com.google.common.primitives.UnsignedLongs.

devinrsmith commented 4 months ago

Here's a public dataset that demonstrates the same issue:

from deephaven import parquet
from deephaven.experimental import s3

from datetime import timedelta

amazon_reviews = parquet.read(
    "s3://datasets-documentation/amazon_reviews/",
    special_instructions=s3.S3Instructions(
        "eu-west-3",
        anonymous_access=True,
        read_ahead_count=8,
        fragment_size=65536,
        read_timeout=timedelta(seconds=10),
    ),
)