arenadata / gpdb

Arenadata DB
https://docs.arenadata.io/en/ADB/current/introduction/intro.html
Apache License 2.0
40 stars 22 forks source link

Fix missing XML body in S3 CompleteMultipartUpload command #1092

Closed whitehawk closed 2 days ago

whitehawk commented 1 week ago

Fix missing XML body in S3 CompleteMultipartUpload command

Problem: Attempt to insert data into a writable external table with data file located on S3 storage ended with warning "InvalidRequest : You must specify at least one part". Though GPDB reported that data was inserted, no file had been created in the S3 bucket. So, actually the insert operation failed.

Cause: Data insertion into S3 external table is done via S3 Multipart upload. This approach requires the following commands:

  1. CreateMultipartUpload to start the process;
  2. one or more UploadPart to pass the data itself;
  3. CompleteMultipartUpload to finish the process.

CompleteMultipartUpload is a POST method request, that should have XML payload with a list of all parts. But the http analysis showed that actual POST request for CompleteMultipartUpload didn't contain the payload. As a consequence, the S3 storage replied with an error.

The reason for missing payload is the following: according to the CURL documentation, if data to POST is provided using the CURLOPT_READFUNCTION and CURLOPT_READDATA options (which is our case), "you must set the size of the data with the CURLOPT_POSTFIELDSIZE or CURLOPT_POSTFIELDSIZE_LARGE options". While in the 'S3RESTfulService::post()' the size of the data was set via CURLOPT_INFILESIZE_LARGE CURL option.

Fix: CURL option CURLOPT_INFILESIZE_LARGE is replaced with CURLOPT_POSTFIELDSIZE_LARGE.

Tests are not presented, as there is no existing test infrastructure to make an auto test with S3 storage.

BenderArenadata commented 1 week ago

Allure report https://allure.adsw.io/launch/83744

BenderArenadata commented 6 days ago

Allure report https://allure.adsw.io/launch/83840

RekGRpth commented 5 days ago

according to the CURL documentation, if data to POST is provided using the CURLOPT_READFUNCTION and CURLOPT_READDATA options (which is our case), "you must set the size of the data with the CURLOPT_POSTFIELDSIZE or CURLOPT_POSTFIELDSIZE_LARGE options".

The documentation doesn't quite have the same wording:

You can set the total size of the data you are sending by using CURLOPT_INFILESIZE_LARGE or CURLOPT_POSTFIELDSIZE_LARGE, depending on the type of transfer. For some transfer types it may be required and it allows for better error checking.

whitehawk commented 5 days ago

according to the CURL documentation, if data to POST is provided using the CURLOPT_READFUNCTION and CURLOPT_READDATA options (which is our case), "you must set the size of the data with the CURLOPT_POSTFIELDSIZE or CURLOPT_POSTFIELDSIZE_LARGE options".

The documentation doesn't quite have the same wording:

You can set the total size of the data you are sending by using CURLOPT_INFILESIZE_LARGE or CURLOPT_POSTFIELDSIZE_LARGE, depending on the type of transfer. For some transfer types it may be required and it allows for better error checking.

Refer to CURLOPT_POST:

"When providing data with a callback, you must transmit it using chunked transfer-encoding or you must set the size of the data with the CURLOPT_POSTFIELDSIZE or CURLOPT_POSTFIELDSIZE_LARGE options."

That is our case, in the S3RESTfulService::post() , CURLOPT_POST is set to 1.

RekGRpth commented 5 days ago

Attempt to insert data into a writable external table with data file located on S3 storage ended with warning "InvalidRequest : You must specify at least one part". Though GPDB reported that data was inserted, no file had been created in the S3 bucket. So, actually the insert operation failed.

Why not also make it so that in such cases an error is issued instead of a warning?

whitehawk commented 5 days ago

Attempt to insert data into a writable external table with data file located on S3 storage ended with warning "InvalidRequest : You must specify at least one part". Though GPDB reported that data was inserted, no file had been created in the S3 bucket. So, actually the insert operation failed.

Why not also make it so that in such cases an error is issued instead of a warning?

It would be a good update, but I think it is better to do it separately. Let's open a new ticket for it and fix with independent commit.

RekGRpth commented 5 days ago

Attempt to insert data into a writable external table with data file located on S3 storage ended with warning "InvalidRequest : You must specify at least one part". Though GPDB reported that data was inserted, no file had been created in the S3 bucket. So, actually the insert operation failed.

Why not also make it so that in such cases an error is issued instead of a warning?

It would be a good update, but I think it is better to do it separately. Let's open a new ticket for it and fix with independent commit.

Could you open new ticket for this?

whitehawk commented 5 days ago

Attempt to insert data into a writable external table with data file located on S3 storage ended with warning "InvalidRequest : You must specify at least one part". Though GPDB reported that data was inserted, no file had been created in the S3 bucket. So, actually the insert operation failed.

Why not also make it so that in such cases an error is issued instead of a warning?

It would be a good update, but I think it is better to do it separately. Let's open a new ticket for it and fix with independent commit.

Could you open new ticket for this?

ADBDEV-6562

whitehawk commented 3 days ago

Was this always bugged and no one ever bothered to check or curl stopped accepting ..INFILESIZE for POST at some point? It's weird that this line stayed there for 8 years.

Hard to say now. I see that this behavior was documented in the CURL at least from 2014 (since the commit). I may guess that the S3 storage, used for testing at the time this code was presented, was tolerable for the missing XML body (as the header aleady contains the upload ID, thus the S3 storage theoretically has enough information to complete the upload even without the payload, though it violates the protocol spec).

BenderArenadata commented 3 days ago

Allure report https://allure.adsw.io/launch/84109