apache / orc

Apache ORC - the smallest, fastest columnar storage for Hadoop workloads
https://orc.apache.org/
Apache License 2.0
682 stars 481 forks source link

[C++] Store decimal values as strings instead of floats in the JSON output of `orc-contents` #1866

Closed hdorio closed 6 months ago

hdorio commented 6 months ago

Currently, the JSON output generated by the orc-contents command line utility stores decimal values using floating-point numbers. This can lead to precision issues and inaccuracies, especially when dealing with financial data.

echo "1.1299999999999991" > test.csv

csv-import "struct<amount:decimal(38,18)>" test.csv test.orc
orc-contents test.orc > test.json

cat test.json | jq .amount
node -e "console.log(JSON.parse(fs.readFileSync('test.json', 'utf8'))['amount']);"
ruby -e "require 'json'; puts JSON.parse(File.read('test.json'))['amount'];"

test.csv: 1.1299999999999991 test.json (test.orc as JSON): {"amount": 1.129999999999999100}

# outputs
[2024-03-30 15:55:24] Start importing Orc file...
[2024-03-30 15:55:24] Finish importing Orc file.
[2024-03-30 15:55:24] Total writer elasped time: 0.000281s.
[2024-03-30 15:55:24] Total writer CPU time: 0.000277s.
1.129999999999999 # Jq
1.129999999999999 # NodeJS
1.129999999999999 # Ruby

Note the truncated 1, the correct output should be 1.1299999999999991

Would it be acceptable to modify Decimal128ColumnPrinter (and Decimal64ColumnPrinter) to return a string? {"amount": "1.129999999999999100"}

wgtmac commented 6 months ago

Thanks for reporting the issue! Yes, that sounds reasonable to me. Would you like to work on it?

hdorio commented 6 months ago

Thanks for reporting the issue! Yes, that sounds reasonable to me. Would you like to work on it?

Thanks for the offer! However, since I'm not familiar with C++, I think it would be best if someone with that expertise takes care of it.

ffacs commented 6 months ago

Let me add an option for the column printer to print decimals as strings.

dongjoon-hyun commented 6 months ago

May I ask why we don't use orc-tools (Java tool) instead?

$ orc-tools version
ORC 2.0.0

$ orc-tools --help
ORC Java Tools

usage: java -jar orc-tools-*.jar [--help] [--define X=Y] <command> <args>

Commands:
   convert - convert CSV and JSON files to ORC
   count - recursively find *.orc and print the number of rows
   data - print the data from the ORC file
   json-schema - scan JSON files to determine their schema
   key - print information about the keys
   meta - print the metadata about the ORC file
   scan - scan the ORC file
   sizes - list size on disk of each column
   version - print the version of this ORC tool

To get more help, provide -h to the command

$ echo "1.1299999999999991" > test.csv

$ orc-tools convert --schema "struct<amount:decimal(38,18)>" test.csv -o test.orc

$ orc-tools data test.orc
[main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Processing data file test.orc [length: 300]
{"amount":"1.1299999999999991"}
________________________________________________________________________________________________________________________
dongjoon-hyun commented 6 months ago

IIUC, this is resolved completely via the following, isn't it?

dongjoon-hyun commented 6 months ago

I marked this as 2.1.0 and closed. Feel free to reopen this if we need to do more.