Open jonathandune opened 1 year ago
Looks like this is unsigned, 64-bit. Supporting it would require reading into a Java BigInteger
or some third-party type, both unappealing options for a high performance library. The alternative of reading into a signed long seems just plain incorrect.
That said, I think compromised support is better than no support. @jonathandune do you have a preference for your use case? BigInteger
is probably the most natural answer here.
https://guava.dev/releases/20.0/api/docs/com/google/common/primitives/UnsignedLong.html is a 3rd party option.
@jonathandune Can you point us explicitly to the specific example parquet file you seem to be referencing in the description? Thanks!
I'll give some context here in case it helps. Our engine is implemented in Java, which makes it hard for us to support unsigned types in general. This is possible, but is not something we currently support and would require considerable effort. We are not alone here btw, eg, this is a common theme for Java based data engines: https://issues.apache.org/jira/browse/SPARK-34816.
Depending on your use case, a short term option is to load the data to a different type supporting the full range (eg, BigInteger or BigDecimal). We may be able to help with that, depending on your actual use case (can you provide an example file?). Also I want to mention, if your data is using unsigned as a matter of semantics but you happen to know from your use case that you are not using the upper range of unsigned (ie, the unsigned values that won't fit on the signed type of the same bit width), then a much simpler alternative is to read the data as signed and just operate on it as that, signed. YMMV here depending on use case.
If you do use the upper range, a different alternative to BigInteger or Bigdecimal would be to still load as signed but consistently use the Guava library's UnsignedLong (linked by Ryan in an earlier comment) to wrap all operations (eg, if you need a column operation adding two values, it would need to be wrapped through calls to functions in that library) and presentation (eg, you would need to convert to String for presentation using that library). This would be a lot more "heavy lifting" and a tax on everything you do for a column of this type, so this path is not advisable unless you really need that upper range and you can't live with just the positive range of the signed type.
Ooops sorry I forgot to share the url of the parquet file !! https://filebin.net/2fer0fw8fvp8041c
The context of the use case is blockchain data in parquet files. Large unsigned numbers are common in the blockchain space (256 bit integers are the norm for some standard datasets)
Maybe crazy idea: if those long bit width integers (eg 256 as your example above) are used as IDs and not to actually do math on them, you may be better off loading them in DH as a byte array, or even more manageable, a hex string (or even shorter, base64 string). I think as a (sane, uniquely defined for this) hex or base64 string, both equality and inequality/range operations would work, which should suffice for id like data. :man_shrugging:
To do this on the 64 bit case, you would first load the data as signed, and then use java code that is careful to look at the column data "as unsigned" to compute a new table with the new column.
But what I think is interesting here is, if you already have to deal with 256 bit integers, and since parquet doesn't support that, maybe you are already dealing with those 256 bit integers in some other way. Whatever you are doing there potentially would make sense to do for 64 bit integers too, for consistency.
Hi @jonathandune, tagging you here so we don't lose track of the conversation. Do you have a preference on what type the unsigned 64 bit ints should map to? Ryan and Cristian have suggested some approaches above. We can provide any further clarifications if you need any.
There are cases where arithmetic will be required on those types so I think going with something standard like BigInteger might make sense ? The parquet file I've provided has been generated using https://github.com/paradigmxyz/cryo there are some details on types used there and seems multiple fields would be impacted
Hi @jonathandune, thanks for the reply. I will check internally and share the next steps with you soon.
I think BigInteger should be the default for this case; ideally, I'd like the ability to configure on a column-by-column basis to change the desired behavior. I could imagine some users preferring to keep as long
and if needed, work with special math like com.google.common.primitives.UnsignedLongs
.
Here's a public dataset that demonstrates the same issue:
from deephaven import parquet
from deephaven.experimental import s3
from datetime import timedelta
amazon_reviews = parquet.read(
"s3://datasets-documentation/amazon_reviews/",
special_instructions=s3.S3Instructions(
"eu-west-3",
anonymous_access=True,
read_ahead_count=8,
fragment_size=65536,
read_timeout=timedelta(seconds=10),
),
)
Description
The parquet reader in io.deephaven.parquet.table.ParquetSchemaReader doesn't seem to handle certain integer types.
Steps to reproduce
ERROR r-Scheduler-Serial-1 | .c.ConsoleServiceGrpcImpl | Error running script: java.lang.RuntimeException: Error in Python interpreter: Type: <class 'deephaven.dherror.DHError'> Value: failed to read parquet data. : RuntimeError: io.deephaven.UncheckedDeephavenException: Unable to read column [gas_used]: IntLogicalType, isSigned=false, bitWidth=64 not supported Traceback (most recent call last): File "/opt/deephaven/venv/lib/python3.10/site-packages/deephaven/parquet.py", line 121, in read return Table(j_table=_JParquetTools.readTable(path)) RuntimeError: io.deephaven.UncheckedDeephavenException: Unable to read column [gas_used]: IntLogicalType, isSigned=false, bitWidth=64 not supported at io.deephaven.parquet.table.ParquetSchemaReader.lambda$readParquetSchema$2(ParquetSchemaReader.java:274) at java.base/java.util.Optional.orElseThrow(Optional.java:403) at io.deephaven.parquet.table.ParquetSchemaReader.readParquetSchema(ParquetSchemaReader.java:268) at io.deephaven.parquet.table.ParquetTools.convertSchema(ParquetTools.java:842) at io.deephaven.parquet.table.ParquetTools.readTableInternal(ParquetTools.java:551) at io.deephaven.parquet.table.ParquetTools.readTable(ParquetTools.java:80)
Expected results
I expected the parquet file to be read
Versions