datastax / dsbulk

DataStax Bulk Loader (DSBulk) is an open-source, Apache-licensed, unified tool for loading into and unloading from Apache Cassandra(R), DataStax Astra and DataStax Enterprise (DSE)
Apache License 2.0
85 stars 30 forks source link

DSBulk 1.11.0 doesn't apply partition level TTL when migrating data with '-ttl true' #493

Closed paayers closed 5 months ago

paayers commented 5 months ago

A customer had this issue recently where they used a process they had used in a previous version of DSBulk and got different results. When unloading a table with '-ttl true', then loading in another cluster with '-ttl true' we weren't seeing the partition level TTL, instead the columns were being expired and the PK could still be selected after TTL so we had to come back and insert with --schema.queryTtl to get rid of the PKs. I'm not sure if this is expected behavior for DSBulk or not, so I'll outline the test case below.

Given the following header and row, it's easy to see what happened:

location_id|name|id|created_at|writetime(created_at)|ttl(created_at)|deleted|writetime(deleted)|ttl(deleted)|end_pick_run_time|writetime(end_pick_run_time)|ttl(end_pick_run_time)|needed_by_datetime|writetime(needed_by_datetime)|ttl(needed_by_datetime)|order_number|writetime(order_number)|ttl(order_number)|pickup_date|writetime(pickup_date)|ttl(pickup_date)|position|writetime(position)|ttl(position)|restricted_item_status|writetime(restricted_item_status)|ttl(restricted_item_status)|start_pick_run_time|writetime(start_pick_run_time)|ttl(start_pick_run_time)|status|writetime(status)|ttl(status)|subtype|writetime(subtype)|ttl(subtype)|type|writetime(type)|ttl(type)|uid|writetime(uid)|ttl(uid)|updated_at|writetime(updated_at)|ttl(updated_at)|version_key|writetime(version_key)|ttl(version_key)
01234567|11111|c111df1c-1c1a-41e1-b110-f111d111111b|2024-05-28T18:03:12.712Z|2024-05-28T18:03:12.716355Z|11111|0|2024-05-28T18:03:12.716355Z|11111|2024-05-28T20:26:20.947Z|2024-05-28T20:26:20.947181Z|11111|2024-05-28T17:00:00Z|2024-05-28T18:03:12.716355Z|11111||||2024-05-28|2024-05-28T18:03:12.716355Z|11111|||||||2024-05-28T20:12:45.986Z|2024-05-28T20:12:45.98613Z|11111|PICKED|2024-05-28T20:26:20.947181Z|11111||||TROLLEY|2024-05-28T18:03:12.716355Z|11111|AB11111|2024-05-28T20:26:20.124392Z|11111|2024-05-28T20:26:20.947Z|2024-05-28T20:26:20.947181Z|11111|1|2024-05-28T20:26:20.947181Z|11111

The DSBulk load is:

dsbulk load -delim '|' --schema.allowMissingFields true --connector.csv.maxCharsPerColumn -1 -cl LOCAL_QUORUM -timestamp true -ttl true -k staging -t vehicles_by_name_location -url /home/automaton/kroger/ibmtest.csv

After flushing, the sstabledump looks like the following with cell level TTLs:

"partition" : {
      "key" : [ "01234567" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 219,
        "clustering" : [ "11111", "c111df1c-1c1a-41e1-b110-f111d111111b" ],
        "liveness_info" : { "tstamp" : "2024-06-14T15:12:35.771483Z" },
        "cells" : [
          { "name" : "created_at", "value" : "2024-05-28 18:03:12.712Z", "tstamp" : "2024-05-28T18:03:12.716355Z", "ttl" : 11111, "expires_at" : "2024-06-14T18:17:46Z", "expired" : false },
          { "name" : "deleted", "value" : false, "tstamp" : "2024-05-28T18:03:12.716355Z", "ttl" : 11111, "expires_at" : "2024-06-14T18:17:46Z", "expired" : false },
          { "name" : "end_pick_run_time", "value" : "2024-05-28 20:26:20.947Z", "tstamp" : "2024-05-28T20:26:20.947181Z", "ttl" : 11111, "expires_at" : "2024-06-14T18:17:46Z", "expired" : false },
          { "name" : "needed_by_datetime", "value" : "2024-05-28 17:00:00.000Z", "tstamp" : "2024-05-28T18:03:12.716355Z", "ttl" : 11111, "expires_at" : "2024-06-14T18:17:46Z", "expired" : false },
          { "name" : "pickup_date", "value" : "2024-05-28", "tstamp" : "2024-05-28T18:03:12.716355Z", "ttl" : 11111, "expires_at" : "2024-06-14T18:17:46Z", "expired" : false },
          { "name" : "start_pick_run_time", "value" : "2024-05-28 20:12:45.986Z", "tstamp" : "2024-05-28T20:12:45.986130Z", "ttl" : 11111, "expires_at" : "2024-06-14T18:17:46Z", "expired" : false },
          { "name" : "status", "value" : "PICKED", "tstamp" : "2024-05-28T20:26:20.947181Z", "ttl" : 11111, "expires_at" : "2024-06-14T18:17:46Z", "expired" : false },
          { "name" : "type", "value" : "TROLLEY", "tstamp" : "2024-05-28T18:03:12.716355Z", "ttl" : 11111, "expires_at" : "2024-06-14T18:17:46Z", "expired" : false },
          { "name" : "uid", "value" : "AB11111", "tstamp" : "2024-05-28T20:26:20.124392Z", "ttl" : 11111, "expires_at" : "2024-06-14T18:17:46Z", "expired" : false },
          { "name" : "updated_at", "value" : "2024-05-28 20:26:20.947Z", "tstamp" : "2024-05-28T20:26:20.947181Z", "ttl" : 11111, "expires_at" : "2024-06-14T18:17:46Z", "expired" : false },
          { "name" : "version_key", "value" : 1, "tstamp" : "2024-05-28T20:26:20.947181Z", "ttl" : 11111, "expires_at" : "2024-06-14T18:17:46Z", "expired" : false }

If pull the parenthesis out of the header and use the following file and load command instead, I get the expected result where the 'expires_at' info is at the partition level.

file:

location_id|name|id|created_at|writetime_created_at|ttl_created_at|deleted|writetime_deleted|ttl_deleted|end_pick_run_time|writetime_end_pick_run_time|ttl_end_pick_run_time|needed_by_datetime|writetime_needed_by_datetime|ttl_needed_by_datetime|order_number|writetime_order_number|ttl_order_number|pickup_date|writetime_pickup_date|ttl_pickup_date|position|writetime_position|ttl_position|restricted_item_status|writetime_restricted_item_status|ttl_restricted_item_status|start_pick_run_time|writetime_start_pick_run_time|ttl_start_pick_run_time|status|writetime_status|ttl_status|subtype|writetime_subtype|ttl_subtype|type|writetime_type|ttl_type|uid|writetime_uid|ttl_uid|updated_at|writetime_updated_at|ttl_updated_at|version_key|writetime_version_key|ttl_version_key
01234567|11111|c111df1c-1c1a-41e1-b110-f111d111111b|2024-05-28T18:03:12.712Z|2024-05-28T18:03:12.716355Z|11111|0|2024-05-28T18:03:12.716355Z|11111|2024-05-28T20:26:20.947Z|2024-05-28T20:26:20.947181Z|11111|2024-05-28T17:00:00Z|2024-05-28T18:03:12.716355Z|11111||||2024-05-28|2024-05-28T18:03:12.716355Z|11111|||||||2024-05-28T20:12:45.986Z|2024-05-28T20:12:45.98613Z|11111|PICKED|2024-05-28T20:26:20.947181Z|11111||||TROLLEY|2024-05-28T18:03:12.716355Z|11111|AB11111|2024-05-28T20:26:20.124392Z|11111|2024-05-28T20:26:20.947Z|2024-05-28T20:26:20.947181Z|11111|1|2024-05-28T20:26:20.947181Z|11111

DSBulk load:

dsbulk load -delim '|' -url /home/automaton/kroger/mytest.csv -header true -query "BEGIN BATCH INSERT INTO staging.vehicles_by_name_location (location_id, name, id, created_at, deleted, end_pick_run_time, needed_by_datetime, order_number, pickup_date, position, restricted_item_status, start_pick_run_time, status, subtype, type, uid, updated_at, version_key) VALUES (:location_id, :name, :id, :created_at, :deleted, :end_pick_run_time, :needed_by_datetime, :order_number, :pickup_date, :position, :restricted_item_status, :start_pick_run_time, :status, :subtype, :type, :uid, :updated_at, :version_key) USING TIMESTAMP :writetime_created_at AND TTL :ttl_created_at; APPLY BATCH;" --batch.mode DISABLED

sstable dump:

"partition" : {
      "key" : [ "01234567" ],
      "position" : 0
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 158,
        "clustering" : [ "11111", "c111df1c-1c1a-41e1-b110-f111d111111b" ],
        "liveness_info" : { "tstamp" : "2024-05-28T18:03:12.716355Z", "ttl" : 11111, "expires_at" : "2024-06-14T18:23:36Z", "expired" : false },
        "cells" : [
          { "name" : "created_at", "value" : "2024-05-28 18:03:12.712Z" },
          { "name" : "deleted", "value" : false },
          { "name" : "end_pick_run_time", "value" : "2024-05-28 20:26:20.947Z" },
          { "name" : "needed_by_datetime", "value" : "2024-05-28 17:00:00.000Z" },
          { "name" : "pickup_date", "value" : "2024-05-28" },
          { "name" : "start_pick_run_time", "value" : "2024-05-28 20:12:45.986Z" },
          { "name" : "status", "value" : "PICKED" },
          { "name" : "type", "value" : "TROLLEY" },
          { "name" : "uid", "value" : "AB11111" },
          { "name" : "updated_at", "value" : "2024-05-28 20:26:20.947Z" },
          { "name" : "version_key", "value" : 1 }

If I run that batch after an initial insert was done using the '-ttl true' flag, after a compaction, we lose the partition level TTL, so after expiry, we can still select the PK and all other columns are showing null. I was under the impression that using the -ttl flag when unloading/loading an entire table or even partition would apply the partition level ttl.
Please let me know if this is a bug with the latest version or if something changed and this is now expected behavior.

absurdfarce commented 5 months ago

Based on a quick analysis I believe this behaviour is consistent with the docs for 1.11.0 @paayers. The current docs indicate that -ttl sets the cell-level time-to-live value which is exactly what you see in your first example. I'll also note that the mapping process defined there (i.e. something of the form "ttl(colname)") is exactly what is used in the data file supplied to the first example. Finally I note that this flag is ignored if -query is supplied which (presumably) is why you aren't seeing cell-level values in your second example.

In the second example you're explicitly using "USING TIMESTAMP... AND TTL" in the CQL for that query. Apparently the CQL commands are setting row-level TTLs rather than a cell-specific value... which makes sense.

I'm unaware of any change in 1.11.0 which would make this happen but I'll double-check that and make sure I didn't miss anything there. That said, if the user is asking whether this behaviour has changed between an older version and the current one it would be useful to know what older version they observed the differing behaviour on.

absurdfarce commented 5 months ago

I think I have a more detailed explanation for you @paayers which might help explain things to your customer.

In your first example above the customer includes explicitly specified writetime and TTL values for at least some vars via some of the fields in their CSV file. Let's take "version_key" as a simple example; note that the CSV snippet in the first example above (what I'm calling the "Failing" example) includes this string:

|version_key|writetime(version_key)|ttl(version_key)

These correspond to per-column writetime and TTL values as discussed in the dsbulk settings docs. After some digging it finally became clear to me that when dsbulk is presented with column-specific writetime or TTL values it actually modifies the CQL it's using to insert values. In these cases it creates a distinct INSERT statement for each column with the column-specific writetime and/or TTL values supplied via the usual "USING" keyword. For reference the relevant section of the code is here.

A quick look at git history suggests that this code came in with this commit and was first included as part of the 1.8.0 release. I'm not sure which older version the customer is comparing 1.11.0 to but if it's older than 1.8.0 then there likely is a change in behaviour from what they expect.

paayers commented 5 months ago

Perfect, thanks for the explanation on this, we should be able to modify the inserts accordingly to resolve this one

absurdfarce commented 5 months ago

Sounds good, thanks @paayers ! I'm going to close this issue out for now. We can re-open if new information suggests there's an actual behavioural bug here.