awslabs / amazon-s3-find-and-forget

Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
Apache License 2.0
238 stars 35 forks source link

Unable to retrieve object: only valid on seekable files #401

Open abhra-gupta-trakstar opened 7 months ago

abhra-gupta-trakstar commented 7 months ago

Hi @matteofigus , with the recent upgrade to v0.66, we are facing issues with Deletion Jobs which are failing with FORGET_PARTIALLY_FAILED error. Upon looking into the logs, the ObjectUpdateFailed error is "Unable to retrieve object: only valid on seekable files"

Do you have any possible leads on what could cause this error? We are using the fix in backend/ecs_tasks/delete_files/parquet_handler.py as mentioned here Any advices?

matteofigus commented 7 months ago

Hi, this seems to be related to a corrupted parquet file? Have you managed to trace back which S3 object is failing specifically, and tried to open it to verify it's ok?

abhra-gupta-trakstar commented 7 months ago

We were able to trace back the parquet file in s3 and the file doesn't look corrupted Screenshot 2024-02-27 at 2 17 08 PM Screenshot 2024-02-27 at 2 29 11 PM

308099014-23beda7a-90db-4c23-adbe-f6b9a9375607

matteofigus commented 7 months ago

What version were you using before updating to v0.66? Just to confirm, you didn't notice this behaviour before updating, is that right?

abhra-gupta-trakstar commented 7 months ago

We were using v0.64 before updating to v0.66 Correct. I can confirm we started noticing this behaviour after update. No such incidents, all deletion jobs were running successfully when we're on v0.64

matteofigus commented 7 months ago

That's strange because we only did work on improving performance on JSON since 0.64. I see in the filename that this issue relates to a object that were recently created. Can you confirm you didn't change anything on the ingestion mechanism, perhaps using different versions of pandas or similar libraries to produce the parquet objects?

abhra-gupta-trakstar commented 7 months ago

Hi @matteofigus, I went back and confirmed with the team there are no ingestion changes we did recently, the only change was upgrading s3f2 to 0.66