datastax / dsbulk

DataStax Bulk Loader (DSBulk) is an open-source, Apache-licensed, unified tool for loading into and unloading from Apache Cassandra(R), DataStax Astra and DataStax Enterprise (DSE)
Apache License 2.0
85 stars 30 forks source link

dsbulk compat with vector type #474

Closed absurdfarce closed 1 year ago

absurdfarce commented 1 year ago

JAVA-3060 introduced support for the vector type to the Java driver. It'd be nice if we could extend this support to dsbulk as well.

Observed errors when trying to load vector data with the current impl + driver upgrade to get CqlVector support:

Source: 1,"[8, 2.3, 58]"\u000a
com.datastax.oss.dsbulk.workflow.commons.schema.InvalidMappingException: Could not map field j to variable j; conversion from Java type java.lang.String to CQL type CqlVector(FLOAT, 3) failed for raw value: [8, 2.3
, 58].
        at com.datastax.oss.dsbulk.workflow.commons.schema.InvalidMappingException.encodeFailed(InvalidMappingException.java:90)
        at com.datastax.oss.dsbulk.workflow.commons.schema.DefaultRecordMapper.bindColumn(DefaultRecordMapper.java:182)
        at com.datastax.oss.dsbulk.workflow.commons.schema.DefaultRecordMapper.bindStatement(DefaultRecordMapper.java:158)
        at com.datastax.oss.dsbulk.workflow.commons.schema.DefaultRecordMapper.map(DefaultRecordMapper.java:127)
        at java.lang.Thread.run(Thread.java:750) [19 skipped]
Caused by: java.lang.IllegalArgumentException: A CQL blob string must start with "0x"
        at com.datastax.oss.protocol.internal.util.Bytes.fromHexString(Bytes.java:123)
        at com.datastax.oss.driver.internal.core.type.codec.CustomCodec.parse(CustomCodec.java:83)
        at com.datastax.oss.driver.internal.core.type.codec.CustomCodec.parse(CustomCodec.java:29)
        at com.datastax.oss.dsbulk.codecs.text.string.StringToUnknownTypeCodec.externalToInternal(StringToUnknownTypeCodec.java:32)
        at com.datastax.oss.dsbulk.codecs.text.string.StringToUnknownTypeCodec.externalToInternal(StringToUnknownTypeCodec.java:21)
        at com.datastax.oss.dsbulk.codecs.api.ConvertingCodec.encode(ConvertingCodec.java:70)
        at com.datastax.oss.dsbulk.workflow.commons.schema.DefaultRecordMapper.bindColumn(DefaultRecordMapper.java:180)
        at com.datastax.oss.dsbulk.workflow.commons.schema.DefaultRecordMapper.bindStatement(DefaultRecordMapper.java:158)
        at com.datastax.oss.dsbulk.workflow.commons.schema.DefaultRecordMapper.map(DefaultRecordMapper.java:127)
        at java.lang.Thread.run(Thread.java:750) [19 skipped]

There may be other issues after we get this one resolved but the underlying issue in the above (a failure to figure out a codec to use for the CqlVector type) isn't super-surprising given the impl.

┆Issue is synchronized with this Jira Task by Unito

annieden commented 1 year ago

@absurdfarce @msmygit Are you looking to resolve this dsbulk compatibility issue to coincide with the GA release of Vector Search on July 17? And is this fix something that needs to be documented in the customer-facing dsbulk documentation?

msmygit commented 1 year ago

@annieden actually this is a new feature addition to DSBulk to support Vector data types.

Having said that, we definitely should document the new capability (i.e. support for vector data types) in the release notes. Maybe we could add some examples to this loading example page (or) update this blog post to include vector data type examples. I'd leave that up to you to decide where. In pov, this doesn't have to wait till next month.

absurdfarce commented 1 year ago

Pretty much what @msmygit said. The new feature should definitely be mentioned in release notes and/or a changelog, that kind of thing. We're not introducing any new flags or anything so there shouldn't be much in the way of new syntax to learn... with that in mind I'm not sure how much we need to add to the existing dsbulk documentation.

annieden commented 1 year ago

Thanks to both of you for your answers!