awslabs / mountpoint-s3

A simple, high-throughput file client for mounting an Amazon S3 bucket as a local file system.
Apache License 2.0
4.66k stars 164 forks source link

Incomplete file cp to s3 mount #1133

Open snowch opened 1 week ago

snowch commented 1 week ago

Mountpoint for Amazon S3 version

1.10.0

AWS Region

n/a

Describe the running environment

Running on local S3 (Vast Data)

Mountpoint options

mount-s3 \
    --log-directory ~/s3.log \
    --debug-crt \
    --region VAST \
    --endpoint-url $AWS_ENDPOINT_URL \
    --allow-delete \
    --uid $(id -u jovyan) \
    --gid $(id -g jovyan) \
    --file-mode 0664 \
    --dir-mode 0775 \
    "$S3A_BUCKET" ${HOME}/s3

What happened?

Files copied to s3 mount have different checksum:

wget -c --quiet https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2024-01.parquet

rm ../s3/nyc-data/yellow_tripdata_2024-01.parquet
cp yellow_tripdata_2024-01.parquet ../s3/nyc-data/
shasum yellow_tripdata_2024-01.parquet
shasum ../s3/nyc-data/yellow_tripdata_2024-01.parquet

The output:

a73b714de7672a752a58de34826e08fae203e91b  yellow_tripdata_2024-01.parquet
2cf39315d5c8e2cab85703477c0169ac9785fcb9  ../s3/nyc-data/yellow_tripdata_2024-01.parquet

Relevant log output

https://gist.github.com/snowch/e2fc06bd420d92f060ee6e985ea3ed73
monthonk commented 1 week ago

Hi, thanks for reporting the issue. Are the checksums always different when uploading with Mountpoint or only in some occasions?

I also noticed that you are using a third-party storage. Do you know whether they support additional checksums or not?

Mountpoint computes checksums for your data by default and send them along with the data so that data integrity can be verified on server side. However, POSIX file operations like read and write do not offer a built-in integrity mechanism and it's possible for data integrity to be lost in transit between your application and Mountpoint. More details in the SEMANTICS doc.

snowch commented 1 week ago

The checksums are always different. Thanks for sharing the semantic information - I've used s3cmd for my usecase for now.

monthonk commented 6 days ago

Thanks for confirming. The logs you provided only have information up to when the MultipartUpload is complete. Would you be able to also share relevant logs from read operation?

snowch commented 6 days ago

Hopefully this should have everything? https://gist.github.com/snowch/1b401dcb5fc4320ee33fce60c4bc28c0