datastax / dsbulk

DataStax Bulk Loader (DSBulk) is an open-source, Apache-licensed, unified tool for loading into and unloading from Apache Cassandra(R), DataStax Astra and DataStax Enterprise (DSE)
Apache License 2.0
82 stars 30 forks source link

Parsing vector data from JSON fails for "floats" with too many digits (aka doubles) #484

Open hemidactylus opened 1 year ago

hemidactylus commented 1 year ago

When ingesting VECTOR<FLOAT,n> data from a JSON, dsbulk (v 1.11) fails for "floats" which are represented with too many digits. They end up being double, which seems to cause unrecoverable problems then.

Notes:

  1. JSON produced by dsbulk itself are OK, i.e. their floats are floats proper (low number of digits).
  2. But, with folks coming to load datasets generated elsewhere (viz Python, which lacks a clear float/double distinction) this limitation might get in the way.

Minimal reproducible case

create table mini_table (id text primary key, embedding vector<float, 2>);
java -jar dsbulk-1.11.0.jar load -k $KEYSPACE -t mini_table -u "token" -p $TOKEN -b $BUNDLEZIP --dsbulk.connector.json.mode SINGLE_DOCUMENT --connector.json.url GOOD_OR_BAD.json -c json
$> cat good.json 
[
 {
  "id": "my_row",
  "embedding": [
   6.64632,
   4.49715
  ]
 }
]

$> cat bad.json 
[
 {
  "id": "my_row",
  "embedding": [
   6.646329843,
   4.4971533213
  ]
 }
]
absurdfarce commented 3 months ago

I couldn't reproduce this, at least not with JSON inputs.

$ cat ../vector_test_data_json_tooprecise/one.json 
{
    "i":1,
    "j":[6.646329843, 4.4971533213, 58]
}
$ bin/dsbulk load -url "./../vector_test_data_json_tooprecise" -k test -t bar -c json
Operation directory: /work/git/dsbulk/dist_test/dsbulk-1.11.0/logs/LOAD_20240626-210637-895657
At least 1 record does not match the provided schema.mapping or schema.query. Please check that the connector configuration and the schema configuration are correct.
total | failed | rows/s | p50ms | p99ms | p999ms | batches
    3 |      1 |     16 |  4.62 |  5.93 |   5.93 |    1.00
Operation LOAD_20240626-210637-895657 completed with 1 errors in less than one second.
$ cat logs/LOAD_20240626-210637-895657/mapping-errors.log 
Resource: file:/work/git/dsbulk/dist_test/vector_test_data_json_tooprecise/one.json
Position: 1
Source: {"i":1,"j":[6.646329843,4.4971533213,58]}
com.datastax.oss.dsbulk.workflow.commons.schema.InvalidMappingException: Could not map field j to variable j; conversion from Java type com.fasterxml.jackson.databind.JsonNode to CQL type Vector(FLOAT, 3) failed for raw value: [6.646329843,4.4971533213,58].
        at com.datastax.oss.dsbulk.workflow.commons.schema.InvalidMappingException.encodeFailed(InvalidMappingException.java:90)
        at com.datastax.oss.dsbulk.workflow.commons.schema.DefaultRecordMapper.bindColumn(DefaultRecordMapper.java:182)
        at com.datastax.oss.dsbulk.workflow.commons.schema.DefaultRecordMapper.bindStatement(DefaultRecordMapper.java:158)
        at com.datastax.oss.dsbulk.workflow.commons.schema.DefaultRecordMapper.map(DefaultRecordMapper.java:127)
        at java.lang.Thread.run(Thread.java:750) [19 skipped]
Caused by: java.lang.ArithmeticException: Cannot convert 6.646329843 from BigDecimal to Float
        at com.datastax.oss.dsbulk.codecs.api.util.CodecUtils.conversionFailed(CodecUtils.java:610)
        at com.datastax.oss.dsbulk.codecs.api.util.CodecUtils.toFloatValueExact(CodecUtils.java:537)
        at com.datastax.oss.dsbulk.codecs.api.util.CodecUtils.convertNumber(CodecUtils.java:333)
        at com.datastax.oss.dsbulk.codecs.api.util.CodecUtils.narrowNumber(CodecUtils.java:191)
        at com.datastax.oss.dsbulk.codecs.text.json.JsonNodeToNumberCodec.narrowNumber(JsonNodeToNumberCodec.java:84)
        at com.datastax.oss.dsbulk.codecs.text.json.JsonNodeToFloatCodec.externalToInternal(JsonNodeToFloatCodec.java:78)
        at com.datastax.oss.dsbulk.codecs.text.json.JsonNodeToFloatCodec.externalToInternal(JsonNodeToFloatCodec.java:34)
        at com.datastax.oss.dsbulk.codecs.text.json.JsonNodeToVectorCodec.lambda$externalToInternal$0(JsonNodeToVectorCodec.java:50)
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
        at java.util.ArrayList$Itr.forEachRemaining(ArrayList.java:901)

This is consistent with the code; the JSON-to-vector codec already leverages dsbulk converting codecs when reading from input strings and these codecs already perform overflow checks.

It was a different story on the string side, however. There we were re-using CqlVector.from() to handle strings, which obviously doesn't allow for the insertion of additional (possibly more rigorous) policies. To support something more rigorous a version of this logic was moved into the dsbulk codecs. This solves the problem but it also makes more sense logically; dsbulk should be in charge of the formats it's willing to accept rather than relying on CqlVector to define that for him.