Open hemidactylus opened 1 year ago
I couldn't reproduce this, at least not with JSON inputs.
$ cat ../vector_test_data_json_tooprecise/one.json
{
"i":1,
"j":[6.646329843, 4.4971533213, 58]
}
$ bin/dsbulk load -url "./../vector_test_data_json_tooprecise" -k test -t bar -c json
Operation directory: /work/git/dsbulk/dist_test/dsbulk-1.11.0/logs/LOAD_20240626-210637-895657
At least 1 record does not match the provided schema.mapping or schema.query. Please check that the connector configuration and the schema configuration are correct.
total | failed | rows/s | p50ms | p99ms | p999ms | batches
3 | 1 | 16 | 4.62 | 5.93 | 5.93 | 1.00
Operation LOAD_20240626-210637-895657 completed with 1 errors in less than one second.
$ cat logs/LOAD_20240626-210637-895657/mapping-errors.log
Resource: file:/work/git/dsbulk/dist_test/vector_test_data_json_tooprecise/one.json
Position: 1
Source: {"i":1,"j":[6.646329843,4.4971533213,58]}
com.datastax.oss.dsbulk.workflow.commons.schema.InvalidMappingException: Could not map field j to variable j; conversion from Java type com.fasterxml.jackson.databind.JsonNode to CQL type Vector(FLOAT, 3) failed for raw value: [6.646329843,4.4971533213,58].
at com.datastax.oss.dsbulk.workflow.commons.schema.InvalidMappingException.encodeFailed(InvalidMappingException.java:90)
at com.datastax.oss.dsbulk.workflow.commons.schema.DefaultRecordMapper.bindColumn(DefaultRecordMapper.java:182)
at com.datastax.oss.dsbulk.workflow.commons.schema.DefaultRecordMapper.bindStatement(DefaultRecordMapper.java:158)
at com.datastax.oss.dsbulk.workflow.commons.schema.DefaultRecordMapper.map(DefaultRecordMapper.java:127)
at java.lang.Thread.run(Thread.java:750) [19 skipped]
Caused by: java.lang.ArithmeticException: Cannot convert 6.646329843 from BigDecimal to Float
at com.datastax.oss.dsbulk.codecs.api.util.CodecUtils.conversionFailed(CodecUtils.java:610)
at com.datastax.oss.dsbulk.codecs.api.util.CodecUtils.toFloatValueExact(CodecUtils.java:537)
at com.datastax.oss.dsbulk.codecs.api.util.CodecUtils.convertNumber(CodecUtils.java:333)
at com.datastax.oss.dsbulk.codecs.api.util.CodecUtils.narrowNumber(CodecUtils.java:191)
at com.datastax.oss.dsbulk.codecs.text.json.JsonNodeToNumberCodec.narrowNumber(JsonNodeToNumberCodec.java:84)
at com.datastax.oss.dsbulk.codecs.text.json.JsonNodeToFloatCodec.externalToInternal(JsonNodeToFloatCodec.java:78)
at com.datastax.oss.dsbulk.codecs.text.json.JsonNodeToFloatCodec.externalToInternal(JsonNodeToFloatCodec.java:34)
at com.datastax.oss.dsbulk.codecs.text.json.JsonNodeToVectorCodec.lambda$externalToInternal$0(JsonNodeToVectorCodec.java:50)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.ArrayList$Itr.forEachRemaining(ArrayList.java:901)
This is consistent with the code; the JSON-to-vector codec already leverages dsbulk converting codecs when reading from input strings and these codecs already perform overflow checks.
It was a different story on the string side, however. There we were re-using CqlVector.from() to handle strings, which obviously doesn't allow for the insertion of additional (possibly more rigorous) policies. To support something more rigorous a version of this logic was moved into the dsbulk codecs. This solves the problem but it also makes more sense logically; dsbulk should be in charge of the formats it's willing to accept rather than relying on CqlVector to define that for him.
When ingesting
VECTOR<FLOAT,n>
data from a JSON, dsbulk (v 1.11) fails for "floats" which are represented with too many digits. They end up being double, which seems to cause unrecoverable problems then.Notes:
Minimal reproducible case