datastax / dsbulk

DataStax Bulk Loader (DSBulk) is an open-source, Apache-licensed, unified tool for loading into and unloading from Apache Cassandra(R), DataStax Astra and DataStax Enterprise (DSE)
Apache License 2.0
83 stars 30 forks source link

Small modification to concatentate two fields #455

Open acscott opened 2 years ago

acscott commented 2 years ago

Thank you for a really powerful, fast, product. I was wanting to make a small modification for our purposes to concatenate two fields for the CSV output.

Say I wanted to concatenate the first and second field instead of delimiting them by ,

I was thinking we could make this happen around this line: https://github.com/datastax/dsbulk/blob/fb1350127315c3a14b4862017c7435b35e2124a0/connectors/csv/src/main/java/com/datastax/oss/dsbulk/connectors/csv/CSVConnector.java#L362

But it doesn't look like you can modify anything, just read.

Any hints?

┆Issue is synchronized with this Jira Task by Unito

adutra commented 1 year ago

Hi @acscott thanks for reaching out.

So far DSBulk has avoided transforming input data; iow it's an ETL without the T :-)

There has been requests to introduce the ability to transform data on-the-fly. However the code you pointed at would not be the right place to do that as this is happening inside a connector, which in this case is responsible solely for reading the input file and emitting records.

The right place to do that would be at the core of DSBulk's engine, where we could imagine a transformer function Record -> Record that would transform the contents of each individual record before they are persisted to the database.

The function body could be provided in a scripting language, then compiled to Java bytecode on-the-fly. Most likely we'd need to sandbox the execution context as it must execute extremely fast, and have no side effects such as disk or network I/O. Also, we'd need to come up with a nice way to initialize any persistent state required by the function.

This would certainly be a nice addition to DSBulk. But I don't think the team has bandwidth for implementing that as of today unfortunately.

The general guidance that we give our users is to instead modify the input data to match your tables. This is generally easy to achieve with command-line tools such as awk or sed.