datastax / dsbulk-migrator

Apache License 2.0
9 stars 8 forks source link

Migrator can "resurrect" deleted row if the row was exported with `null` value. #12

Open sync-by-unito[bot] opened 1 year ago

sync-by-unito[bot] commented 1 year ago

This may relates to #23, but I got the case where the migrator resurrected the deleted row.

The customer already deleted the exported CSVs so below is the hypothesis, but I confirmed the case:

The migrator command line options used

java -Xmx2048m -jar /migration/migrator/schema-migrator-1.0.0-SNAPSHOT-embedded-driver.jar migrate-live \
     --export-host=<host> \
     --export-username=<u> --export-password=<p> \
     --import-username=<u> --import-password=<p> \
     --dsbulk-cmd=/migration/dsbulk-1.7.0/bin/dsbulk \
     --dsbulk-log-dir=/opt/migration/logs \
     --data-dir=/opt/migration/out \
     --export-dsbulk-option \
     --connector.csv.maxCharsPerColumn=-1 \
     --export-dsbulk-option \
     --executor.maxPerSecond=2500 \
     --export-dsbulk-option \
     --schema.splits=12 \
     --export-dsbulk-option \
     --driver.basic.request.page-size=10 \
     --import-dsbulk-option \
     --connector.csv.maxCharsPerColumn=-1 \
     --import-bundle <bundle> \
     --keyspaces ks

Steps to reproduce

  1. Insert row with null value:
INSERT INTO ks.test (
   pk,
   cl,
   col1,
   col2
) VALUES (
  4e985514-bfb1-36ab-bec3-dbcb7e12fd53,
  'SGD',
  94.96,
  null
);

(CREATE TABLE test with (pk uuid, cl text, col1 decimal, col2 text, PRIMARY KEY (pk, cl)));

  1. Use migrator generated dsbulk command to export table, you will get the following in CSV:
4e985514-bfb1-36ab-bec3-dbcb7e12fd53,SGD,94.96,2021-04-06T03:55:49.839432Z,,,,,

The last ,,, indicates that col2 column is null and thus has no writetime nor ttl.

  1. Having this in the CSV, delete the row you just inserted:
DELETE FROM test WHERE pk=4e985514-bfb1-36ab-bec3-dbcb7e12fd53 AND cl='SGD';
  1. SELECT * FROM test should not return the row, because we deleted it.
  2. Import CSV with migrator generated dsbulk command.
  3. And do SELECT * FROM test again, you will see the row again with regular columns with null, and only having primary key columns.

I think the reason this happens is that we have null in col2 column, and the way cloud migrator imports with INSERT INTO ks.test (pk, cl, col2) VALUES (:pk, :cl, :col2) USING TIMESTAMP :col2_writetime AND TTL :col2_ttl;. The col2 column will be inserted but there is no col2_writetime output in CSV, this will create the null column with the timestamp higher than DELETE.

┆Issue is synchronized with this Jira Task by Unito ┆Components: Schema Migrator ┆Priority: Major

sync-by-unito[bot] commented 1 year ago

➤ Alice Lottini commented:

As per Biyou's suggestion, these options can be used to avoid importing tombstones:

./bin/dsbulk load h header false nullStrings "NULL" cl EACH_QUORUM delim "," url "" k t

-connector.csv.nullValue ""

sync-by-unito[bot] commented 1 year ago

➤ Yuki Morishita commented:

I tested with the nullString and nullValue options but it didn't work. I think the problem is that writetime for null value is empty on CSV, and when loading, timestamp is not set or maybe the current timestamp is used and the null value is inserted with the recent timestamp.

sync-by-unito[bot] commented 1 year ago

➤ Wei Deng commented:

Yuki Morishita is this still an issue?

sync-by-unito[bot] commented 1 year ago

➤ Yuki Morishita commented:

Wei Deng I believe so, this is the fundamental problem of CSV import of deleted records.

It can happen in dsbulk 1.8+ with preserve timestamp option.

{quote}I think the reason this happens is that we have null in col2 column, and the way cloud migrator imports with INSERT INTO ks.test (pk, cl, col2) VALUES (:pk, :cl, :col2) USING TIMESTAMP :col2_writetime AND TTL :col2_ttl;. The col2 column will be inserted but there is no col2_writetime output in CSV, this will create the null column with the timestamp higher than DELETE.{quote}