[CQLReplicator on Glue] cassandra to AWS keyspaces replication discovery tile error - Column 'xxxxxxx' does not exist

frozensky commented 9 months ago

Describe the bug Rerun cqlreplicator to continue replication and discovery tile error claim a column does not exist.

To Reproduce Steps to reproduce the behavior:

Run command './cqlreplicator --state run --tiles 40 --writetime-column modificationtime --landing-zone s3://cqlrep-prd-1 --region us-west-1 --src-keyspace quark --src-table personalcontentslists --trg-keyspace quark --trg-table personalcontentslists --override-rows-per-worker 2000000 --inc-traffic'

table scheme


cqlsh> desc table quark.personalcontentslists;

CREATE TABLE quark.personalcontentslists ( accountid text, uxrowid text, elementcount int, modelid text, modificationtime timestamp, sortedcontents text, sortedcontents_bucket_1 text, sortedcontents_bucket_2 text, sortedcontents_bucket_3 text, sortedcontents_bucket_4 text, sortedcontents_bucket_5 text, PRIMARY KEY (accountid, uxrowid) ) WITH CLUSTERING ORDER BY (uxrowid ASC) AND bloom_filter_fp_chance = 0.01 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} AND comment = '' AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'} AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance = 1.0 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND nodesync = {'enabled': 'true'} AND speculative_retry = '99PERCENTILE';



4. See error
Error Category: QUERY_ERROR; Failed Line Number: 712; Spark Error Class: MISSING_COLUMN; Exception in User Class: org.apache.spark.sql.AnalysisException : Column 'accountid' does not exist. Did you mean one of the following? [];

2024-02-08 05:00:34,913 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(77)): InvocationTargetException java.lang.reflect.InvocationTargetException
--
2024-02-08 05:00:34,914 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(98)): Exception in User Classorg.apache.spark.sql.AnalysisException: Column 'accountid' does not exist. Did you mean one of the following? [];'RepartitionByExpression ['accountid, 'uxrowid], 38+- LogicalRDD false  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:54) ~[spark-catalyst_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$7(CheckAnalysis.scala:199) ~[spark-catalyst_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$7$adapted(CheckAnalysis.scala:192) ~[spark-catalyst_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]  at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:399) ~[spark-catalyst_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$6(CheckAnalysis.scala:192) ~[spark-catalyst_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$6$adapted(CheckAnalysis.scala:192) ~[spark-catalyst_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]  at scala.collection.immutable.Stream.foreach(Stream.scala:533) ~[scala-library-2.12.15.jar:?]   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:192) ~[spark-catalyst_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1]

**Screenshots**
attach screenshots for review

![Screenshot 2024-02-07 at 11 34 55 PM](https://github.com/aws-samples/cql-replicator/assets/4239718/02f024c9-de7b-428b-9e79-886f9ad0c55b)

![Screenshot 2024-02-07 at 11 39 39 PM](https://github.com/aws-samples/cql-replicator/assets/4239718/7a5edd3b-0880-4e0d-9fdb-09d821bd71b8)

![Screenshot 2024-02-07 at 11 39 05 PM](https://github.com/aws-samples/cql-replicator/assets/4239718/5e8b8098-0c1c-41be-88ff-95944c50ede4)

![Screenshot 2024-02-07 at 11 39 14 PM](https://github.com/aws-samples/cql-replicator/assets/4239718/d90db907-3367-4c00-bd63-34375142fdeb)

Ref: https://us-west-1.console.aws.amazon.com/cloudwatch/home?region=us-west-1#logsV2:log-groups/log-group/$252Faws-glue$252Fjobs$252Ferror/logevents/jr_3e43a86ccc3e462006abcc3c2cee51d09f0cac9702e544f7a7013ddc65d91256_attempt_2

nwheeler81 commented 9 months ago

@frozensky

./cqlreplicator --state run --tiles 2 --landing-zone s3://cql-replicator-1234567890-us-east-1 --writetime-column modificationtime --region us-east-1 --src-keyspace quark1 --src-table personalcontentslists --trg-keyspace quark2 --trg-table personalcontentslists:

Source: quark1.personalcontentslists:
Target: quark2.personalcontentslists I was not able to reproduce it.

Options to validate: Maybe sparkSession.read trying to read an empty s3/prefix and schema missing the columns.

Options to try:

Disable vectorizing reading sparkSession.conf.set("spark.sql.parquet.enableVectorizedReader","false")
Try to use --safe-mode-disabled when use --state run to enable MEMORY_AND_DISK_SER caching strategy
Try to set inferSchema to false in keysDiscoveryProcess

nwheeler81 commented 9 months ago

the issue: inconsistent state of the primary keys on S3

aws-samples / cql-replicator

[CQLReplicator on Glue] cassandra to AWS keyspaces replication discovery tile error - Column 'xxxxxxx' does not exist #112