Open carcuevas opened 4 years ago
Also tried with
EMR_RELEASE='emr-5.27.0'
with the same result
One more detail, can it be because certain files have special characters which are not allowed by S3 as for example:
logos/files/000/000/007/original/photo%282%29.JPG
I will try to remove the files from the list it's because of that maybe...
I think you're right about the special characters. We're looking at this -- thanks for opening the issue! 👍
Thanks to you for looking at it ;-) btw I tried this in the copy_objects.py:
def copy_objects(spark, inventory_table, inventory_date, partitions, copy_acls):
query = """
SELECT bucket, key
FROM {}
WHERE dt = '{}'
AND key like '%=%%' ESCAPE '='
AND (replication_status = '""'
OR replication_status = '"FAILED"')
""".format(inventory_table, inventory_date)
but no luck with the
ESCAPE '='
I got this, I get I am missing here something in the syntax...
Query:
SELECT bucket, key
FROM default.crr_private_files
WHERE dt = '2019-11-06-00-00'
AND key like '%=%%' ESCAPE '='
AND (replication_status = '""'
OR replication_status = '"FAILED"')
Traceback (most recent call last):
File "/home/hadoop/copy_objects.py", line 185, in <module>
copied_objects = copy_objects(spark, inventory_table, inventory_date, partitions, acls)
File "/home/hadoop/copy_objects.py", line 143, in copy_objects
crr_failed = spark.sql(query)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 767, in sql
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 73, in deco
pyspark.sql.utils.ParseException: u'\nmismatched input \'ESCAPE\' expecting <EOF>(line 5, pos 28)\n\n== SQL ==\n\n SELECT bucket, key\n FROM default.crr_private_files\n WHERE dt = \'2019-11-06-00-00\'\n AND key like \'%=%%\' ESCAPE \'=\'\n----------------------------^^^\n AND (replication_status = \'""\'\n OR replication_status = \'"FAILED"\')\n \n'
Somehow I cannot make it work, I guess there is some problem with the parser for the query, maybe also doesn't like the %
symbols, so I cannot excluded the files with bad character names...
SELECT bucket, key
FROM default.crr_private_files
WHERE dt = '2019-11-06-00-00'
AND (replace (key, '%', '@') NOT LIKE '%@%
AND (replication_status = '""'
OR replication_status = '"FAILED"'))
Traceback (most recent call last):
File "/home/hadoop/copy_objects.py", line 185, in <module>
copied_objects = copy_objects(spark, inventory_table, inventory_date, partitions, acls)
File "/home/hadoop/copy_objects.py", line 143, in copy_objects
crr_failed = spark.sql(query)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 767, in sql
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 73, in deco
pyspark.sql.utils.ParseException: u'\nmismatched input \'FROM\' expecting <EOF>(line 3, pos 8)\n\n== SQL ==\n\n SELECT bucket, key\n FROM default.crr_private_files\n--------^^^\n WHERE dt = \'2019-11-06-00-00\'\n AND (replace (key, \'%\', \'@\') NOT LIKE \'%@%\n AND (replication_status = \'""\'\n OR replication_status = \'"FAILED"\'))\n \n'```
@carcuevas Sorry for the delay. We haven't been able to reproduce this, unfortunately. Can you try with this branch: head-fix
? We think the object may have no longer existed, i.e., it was in inventory but got deleted since. The branch catches any errors and writes CopyInPlace
to FALSE
. Then you can filter the results to see the objects that failed.
Hi @msambol tomorrow morning I am going to give a try :) Thanks very much for trying to solve this...!! :) I'll let you know how it looks ....
thanks
@carcuevas I beg my pardon, but were you able to test it?
@carcuevas And yet, how did you solve this problem?
Hi,
I tried the method as described in: https://aws.amazon.com/blogs/big-data/trigger-cross-region-replication-of-pre-existing-objects-using-amazon-s3-inventory-amazon-emr-and-amazon-athena/
And somehow once the EMR cluster is created, and the QUERY is done to the Athena DB, somehow it's not starting to copy the files, as it can be seen here after:
Somehow, there is no more logs about it, the config I used in the script is:
The Athena DB and tables are ok (I gave different names here) ... but somehow it seems to be that maybe it should be installed some additional library or something???
Thanks very much