IBM / ibm-cos-sdk-python-core

ibm-cos-sdk-python-core
Apache License 2.0
6 stars 14 forks source link

Python library not downloading full file with download_fileobj #17

Closed gaskinner84 closed 1 year ago

gaskinner84 commented 2 years ago

Detailed description

  1. When downloading a file from a COS using the official python library ibm-cos-sdk, it can happen that the file is not fully downloaded when using the client.download_fileobj function.
  2. When replacing the function by client.download_file, the same file that is giving issues, download the file fully.

Additional information

Is this a know issue? Is there a download to use client.download_file compared to client.download_fileobj? The documentation talks about 'This is a managed transfer which will perform a multipart download in multiple threads if necessary.' Does this mean downlload_file does not support this? See: https://ibm.github.io/ibm-cos-sdk-python/reference/services/s3.html#S3.Client.download_fileobj

the customer has attached the files used for each method, and we can see their md5 do not match. From ibm's COS logs, we show this object was requested (GET) at 263,136 bytes. From a COS perspective, we do not see any differences in both requests. Both requests are REST.GET.OBJECTS with the same request_length and object_length.

Can the SDK help this customer find out why this file is different with both methods for download?

download_file_code_blocks (1).txt 6aa76260-26c4-48dd-aca9-1f21eb9edc3a (2).txt 6aa76260-26c4-48dd-aca9-1f21eb9edc3a.txt

aa76260-26c4-48dd-aca9-1f21eb9edc3a.txt - download_file (correct file) - MD5: 10540b995152204b5d66a0b25ef7427a 6aa76260-26c4-48dd-aca9-1f21eb9edc3a (2).txt - download_fileobj - MD5: 3fe345ab339f6f3c29a16455f3866f2f

Regards,

Gary S. ACS-Storage Support Lead IBM Cloud Support

gaskinner84 commented 2 years ago

Also, the files have been changed ext from .stl to .txt to be able to upload here.

amukherjee28 commented 2 years ago

hi @gaskinner84

I did try out the scenario being explained. Downloaded file of 765.2 MB from one of the cos bucket using both the API

# Create client connection
cos_cli = ibm_boto3.client("s3",
    ibm_api_key_id=COS_API_KEY_ID,
    ibm_service_instance_id=COS_INSTANCE_CRN,
    config=Config(signature_version="oauth"),
    endpoint_url=COS_ENDPOINT,
    ibm_auth_endpoint=COS_AUTH_ENDPOINT
)

#using download_fileobj
with open("log5.txt", 'wb') as data:
    cos_cli.download_fileobj(new_bucket_name, new_text_file_name, data)

#using download_file
cos_cli.download_file(new_bucket_name, new_text_file_name, 'log6.txt')

After the download is complete I do see both the files being downloaded completely.

-rw-r--r--  1 arnabmukherjee  staff  802374840 Sep  6 01:03 log5.txt
-rw-r--r--  1 arnabmukherjee  staff  802374840 Sep  6 01:08 log6.txt

Let me check a bit more into this issue and I will get back on this with more updates.

Meanwhile let me know if the example I have shown above is similar to what is being attempted.

Thanks

gaskinner84 commented 2 years ago

Thanks for your response, Arnab.

Can you give me the IBM COS bucket name you used for your attempt? I can verify if the requests look the same.

I believe the client is using:

    with NamedTemporaryFile(suffix='.stl') as f:
    try:
        cos_client.download_fileobj(bucket_name, key, f)
        end_time = time()

        logger.info("Reading file {}, file has a size of {}MB".format(key, stat(f.name).st_size / 1000000.0))

        save_new_metric("file_size", stat(f.name).st_size / 1000000.0, "MB", "Case file size ({})".format(key))
        save_new_metric("time_download_case_file", end_time - start_time, "", "Download case file time")

        return pymesh.load_mesh(f.name)

    except ibm_botocore.exceptions.BotoCoreError as e:
        logger.error("ibm_botocore.exceptions.BotoCoreError: {}".format(e))
        raise COSDownloadError(file_name=file_name)
    except IOError as e:
        logger.error("IOError: {}".format(e))
        raise STLParseError(file_name=file_name)
    except Exception as e:
        logger.error("Exception: {}".format(e))
        raise STLParseError(file_name=file_name)
        -------------
amukherjee28 commented 2 years ago

The bucket the used in the example was clibucket1. It has a file named logFile, which I used to download. I have used my personal ID to have the COS instance created and the bucket is in the same instance.

gaskinner84 commented 2 years ago

Thank you for that. Is there anyway to debug why the client is getting difference file sizes with each method?

gaskinner84 commented 2 years ago

Also,

Can you try a smaller file and look at the results? The customer file was 263,136 bytes long.

I will also ask the customer to try a larger file to see if the results are the same.

Regards,

Gary S. ACS-Storage Support Lead IBM Cloud Support

amukherjee28 commented 2 years ago

Hi Gary,

I tired with the above option as well and have similar result. I suspect some environment at the customer end may be causing the issue.

Also could you help me with the version of ibm-cos-sdk being used while running the application and also the python version.

Please let me know if you would like to go for a call and check once how things are being handled on the customers end.

Thanks.

arnabm28 commented 2 years ago

Hi,

With respect to above issue, we investigate more and there are three things I wanted to point out in this regard.

We tried to simulate an exact application code based on the code snippet provided by customer. The object was downloaded using the API's and the MD5 signature before and after the download was checked and they all match. Here is the code snippet which we have tried.

from tempfile import NamedTemporaryFile
import ibm_boto3
from ibm_botocore.client import Config
from ibm_botocore.exceptions import ClientError

COS_ENDPOINT = "<ENDPOINT>" 
COS_API_KEY_ID = "<API KEY>" 
COS_STORAGE_CLASS = ""<storage class>"
COS_INSTANCE_CRN = "<CRN>" 
COS_AUTH_ENDPOINT = "<TOKEN>"

# Create client connection
cos_cli = ibm_boto3.client("s3",
    ibm_api_key_id=COS_API_KEY_ID,
    ibm_service_instance_id=COS_INSTANCE_CRN,
    config=Config(signature_version="oauth"),
    endpoint_url=COS_ENDPOINT,
    ibm_auth_endpoint=COS_AUTH_ENDPOINT
)

new_bucket_name = "clibucket1"
new_text_file_name = "custLog.txt"

#using download_fileobj
with NamedTemporaryFile(suffix='.stl',delete=False) as f:
    print(f.name)
    cos_cli.download_fileobj(new_bucket_name, new_text_file_name, f)

with NamedTemporaryFile(suffix='.stl',delete=False) as f:    
    print(f.name)
    cos_cli.download_file(new_bucket_name, new_text_file_name, f.name)

Results after the application run

python3 issue_17_1.py 
/var/folders/x1/hjvhxfwd1nqc8my3_hhk84_w0000gn/T/tmpkw3bnk3i.stl
/var/folders/x1/hjvhxfwd1nqc8my3_hhk84_w0000gn/T/tmpwtt2_nzf.stl

MD5 /var/folders/x1/hjvhxfwd1nqc8my3_hhk84_w0000gn/T/tmpkw3bnk3i.stl /var/folders/x1/hjvhxfwd1nqc8my3_hhk84_w0000gn/T/tmpwtt2_nzf.stl
MD5 (/var/folders/x1/hjvhxfwd1nqc8my3_hhk84_w0000gn/T/tmpkw3bnk3i.stl) = 10540b995152204b5d66a0b25ef7427a
MD5 (/var/folders/x1/hjvhxfwd1nqc8my3_hhk84_w0000gn/T/tmpwtt2_nzf.stl) = 10540b995152204b5d66a0b25ef7427a

The MD5 signature match exactly after the object was downloaded.

These brings us on few points to check in the customers environment.

  1. Are the file/objects that are being downloaded of same size. Can this be reverified. Also is the object being downloaded same in both the cases.

  2. In the method save_new_metrics is the files being modified in any way. This may cause change in the file content hence asked.

  3. Please reverify the MD5 signature after download. We see no difference at all.

Thanks.

bschichtel commented 2 years ago

@arnabm28

Please see the customer's update and share your subsequent analysis. I've also relayed the same to the COS team as well.

Response from customer


I have quickly ran the test code you provided with some extra logging to provide the information for new COS logs. The 'save_new_metric' function does not read or write the downloaded file.

Output (filesize in bytes, md5, bucket, filename, function, time in utc):

python test.py
/tmp/tmprybkm8v8.stl
262144 3fe345ab339f6f3c29a16455f3866f2f dev-helios-case cases/4248/6aa76260-26c4-48dd-aca9-1f21eb9edc3a.stl download_fileobj 2022-09-14 17:44:40.260185+00:00
/tmp/tmpy74y8_1l.stl
263136 10540b995152204b5d66a0b25ef7427a dev-helios-case cases/4248/6aa76260-26c4-48dd-aca9-1f21eb9edc3a.stl download_file 2022-09-14 17:45:40.496874+00:00

Code:


from tempfile import NamedTemporaryFile
from os import stat
import ibm_boto3
from ibm_botocore.client import Config
from ibm_botocore.exceptions import ClientError
import hashlib
from datetime import datetime,timezone
from time import sleep

COS_ENDPOINT = ""
COS_API_KEY_ID = ""
COS_INSTANCE_CRN = ""
COS_AUTH_ENDPOINT = ""

# Create client connection
cos_cli = ibm_boto3.client("s3",
ibm_api_key_id=COS_API_KEY_ID,
ibm_service_instance_id=COS_INSTANCE_CRN,
config=Config(signature_version="oauth"),
endpoint_url=COS_AUTH_ENDPOINT
)

new_bucket_name = "dev-helios-case"
new_text_file_name = "cases/4248/6aa76260-26c4-48dd-aca9-1f21eb9edc3a.stl"

def md5(fname):
hash_md5 = hashlib.md5()
with open(fname, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()

#using download_fileobj
with NamedTemporaryFile(suffix='.stl',delete=False) as f:
print(f.name)
cos_cli.download_fileobj(new_bucket_name, new_text_file_name, f)
print(stat(f.name).st_size, md5(f.name), new_bucket_name, new_text_file_name, "download_fileobj", datetime.now(timezone.utc))

sleep(30)

with NamedTemporaryFile(suffix='.stl',delete=False) as f:
print(f.name)
cos_cli.download_file(new_bucket_name, new_text_file_name, f.name)
print(stat(f.name).st_size, md5(f.name), new_bucket_name, new_text_file_name, "download_file", datetime.now(timezone.utc))

I used sleep(60) to get that output instead of the sleep(30) that I pasted in the code section, as the downloaded files are downloaded one minute apart.
arnabm28 commented 2 years ago

Hi @bschichtel

I see here what the issue is.

The Issue here is the implementation of the method

def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

When the value of MD5 is calculated using the method which uses python function hashlib the returned MD5 value differ in the signature.

But when the same downloaded file is compared for MD5 signature value using the CLI ``MD5``` it returns same value.

###VALUES RETURNED FROM THE CODE
python3 custFile_issue17.py 
/var/folders/x1/hjvhxfwd1nqc8my3_hhk84_w0000gn/T/tmpkqt9m4tr.stl
262144 3fe345ab339f6f3c29a16455f3866f2f clibucket1 custLog.txt download_fileobj 2022-09-19 20:55:08.413940+00:00
/var/folders/x1/hjvhxfwd1nqc8my3_hhk84_w0000gn/T/tmpzgau88aq.stl
263136 10540b995152204b5d66a0b25ef7427a clibucket1 custLog.txt download_file 2022-09-19 20:55:42.409479+00:00

###VALUES RETURNED FROM CLI MD5
arnabmukherjee@Arnabs-MBP.Dlink:~/work/COS-lab-hands-on/python-sdk$>MD5 /var/folders/x1/hjvhxfwd1nqc8my3_hhk84_w0000gn/T/tmpkqt9m4tr.stl /var/folders/x1/hjvhxfwd1nqc8my3_hhk84_w0000gn/T/tmpzgau88aq.stl
MD5 (/var/folders/x1/hjvhxfwd1nqc8my3_hhk84_w0000gn/T/tmpkqt9m4tr.stl) = 10540b995152204b5d66a0b25ef7427a
MD5 (/var/folders/x1/hjvhxfwd1nqc8my3_hhk84_w0000gn/T/tmpzgau88aq.stl) = 10540b995152204b5d66a0b25ef7427a

In general there is no difference seen in the MD5 signature.

Coming to difference in file size,

It's a samilar thing. Python returns a different file size when used stat(f.name).st_size.

However comparing the file size from prompt it returns similar file size.

-rw-------  1 arnabmukherjee  staff  263136 Sep 20 02:25 tmpkqt9m4tr.stl
-rw-r--r--  1 arnabmukherjee  staff  263136 Sep 20 02:25 tmpzgau88aq.stl

The interpretation of values here is different when using python function but the files downloaded are similar.

Can you check the values in the customers environment and compare it in similar way that I have shown.

bschichtel commented 2 years ago

Hello, we've recieved the following update. Knowing the issue now is related to Python's "hashlib.md5" do you have any suggestion to work around the problem, possibly another checksum routine other than md5 such has "hashlib.sha256" for example?


The SDK team has already identified that the issue is related to the md5sum implementation in the customer code: https://github.com/IBM/ibm-cos-sdk-python-core/issues/17#issuecomment-1251545446

This is the function that the customer provided that does the md5sum hashing in their app:

def md5(fname): hash_md5 = hashlib.md5() with open(fname, "rb") as f: for chunk in iter(lambda: f.read(4096), b""): hash_md5.update(chunk) return hash_md5.hexdigest()

They are using the following Python function: "hashlib.md5()" wich is not part of the COS Python SDK. Please suggest the customer to reach out to the team that support the hashlib library: https://docs.python.org/3/library/hashlib.html

Also please post this question in the github ticket that was filed by your team with the COS SDK team. Perhaps they can suggest something else: https://github.com/IBM/ibm-cos-sdk-python-core/issues/17

arnabm28 commented 2 years ago

@bschichtel

As the current status of the issue does not have a direct impact on the SDK, can we close this issue. The investigation on how to get the values for MD5 checksum for a file is a separate workaround.

Thanks.

arnabm28 commented 2 years ago

Also I was able to solve the issue with the code here. There is no issue with the implementation here as well. Just a small correction in the way code is written. After each file read operation the buffer for file is not being cleared and hence a incorrect value of MD5 checksum is shown. Putting a close() pointer after each file operation solves the issue.

Please pass on this new code to the customer and the issue should be resolved.

def md5(fname):
    hash_md5 = hashlib.md5()
    with open(fname, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()
    #return hashlib.md5(open(fname,'rb').read()).hexdigest()

#using download_fileobj
with NamedTemporaryFile(suffix='.stl',delete=False) as f1:
    print(f1.name)
    cos_cli.download_fileobj(new_bucket_name, new_text_file_name, f1)
f1.close()
print(stat(f1.name).st_size, md5(f1.name), new_bucket_name, new_text_file_name, "download_fileobj", datetime.now(timezone.utc))

sleep(5)

with NamedTemporaryFile(suffix='.stl',delete=False) as f2:
    print(f2.name)
    cos_cli.download_file(new_bucket_name, new_text_file_name, f2.name)
f2.close()
print(stat(f2.name).st_size, md5(f2.name), new_bucket_name, new_text_file_name, "download_file", datetime.now(timezone.utc))

Here's the output that I see

python3 custFile_issue17.py 
/var/folders/x1/hjvhxfwd1nqc8my3_hhk84_w0000gn/T/tmp10lxanl2.stl
263136 10540b995152204b5d66a0b25ef7427a clibucket1 custLog.txt download_fileobj 2022-09-30 07:18:28.834874+00:00
/var/folders/x1/hjvhxfwd1nqc8my3_hhk84_w0000gn/T/tmpga4gspcc.stl
263136 10540b995152204b5d66a0b25ef7427a clibucket1 custLog.txt download_file 2022-09-30 07:18:40.926206+00:00
bschichtel commented 2 years ago

@arnabm28

We relayed you response to our customer who in turn came back with the following. Feel free to archive this issue and thanks for the assistance.

=========== I am howerever not sure that this is a reliable solution. According to: https://docs.python.org/3/library/tempfile.html

** On TemporaryFile: Return a file-like object that can be used as a temporary storage area. The file is created securely, using the same rules as mkstemp(). It will be destroyed as soon as it is closed (including an implicit close when the object is garbage collected). Under Unix, the directory entry for the file is either not created at all or is removed immediately after the file is created. Other platforms do not support this; your code should not rely on a temporary file created using this function having or not having a visible name in the file system.

On NamedTemporaryFile: This function operates exactly as TemporaryFile() does, except that the file is guaranteed to have a visible name in the file system (on Unix, the directory entry is not unlinked).

Anyhow, I think I have enough information to create a workaround. Thank you for the assistance.

arnabm28 commented 2 years ago

@bschichtel You can close on this issue. Thanks