Azure / azure-storage-python

Microsoft Azure Storage Library for Python
https://azure-storage.readthedocs.io
MIT License
338 stars 240 forks source link

MD5 validation using Storage SDK is not happening while databricks filesystem API reports MD5 mismatch #627

Closed pchinta closed 4 years ago

pchinta commented 4 years ago

Which service(blob, file, queue) does this issue concern?

Blob

Which version of the SDK was used? Please provide the output of pip freeze.

azure-storage-blob 2.1.0

What problem was encountered?get_blob_to_bytes(container, missing_filename, validate_content=True).content is NOT reporting any MD5 mismatch issue while Azure Databricks filesystem API (%fs head OR with open(file) as f: print (f.read()) reporting MD5 mismatch error and read fails.

Have you found a mitigation/solution?

No

Note: for table service, please post the issue here instead: https://github.com/Azure/azure-cosmosdb-python.

zezha-msft commented 4 years ago

Hi @pchinta, thanks for reaching out.

Could you please provide a bit more details so that we can try to repro this issue? How big was the file? Have you validated the stored MD5 on your own?

pchinta commented 4 years ago

Hi,

The blob is stored on a storage account owned by Blaize (copied on the thread). Blob size is 4MB and it is outlook mail. Since its PII info we do not have copy of the blob, and so did not validate MD5 of the blob. Attached is the error reported for the blob in scope of the issue by databricks file system API for the same file.

Caused by: com.microsoft.azure.storage.StorageException: Blob data corrupted (integrity check failed), Expected value is c1b80a33c85d2b1cb092d92fc817324d, retrieved wbgKM8hdKxywktkvyBcyTQ== at com.microsoft.azure.storage.blob.BlobInputStream.readInternal(BlobInputStream.java:466) at com.microsoft.azure.storage.blob.BlobInputStream.read(BlobInputStream.java:420) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.read(DataInputStream.java:149)

@Blaize Berry Blaize.Berry@walmart.commailto:Blaize.Berry@walmart.com, Could you please help us understand on MD5 computation to setup repro by Azure Storage SDK team?

Best Regards, Purna Chandra Rao Chinta Support Escalation Engineer Microsoft Azure Rapid Response Team Email : pchinta@microsoft.commailto:pchinta@microsoft.com Working hours: 0900 – 2000 Central Time (Wed – Sat) If you need immediate attention for any of the ongoing case while I am out of shift, please send a separate email with case ID to arrbackup@microsoft.commailto:arrbackup@microsoft.com. A Duty Manager will take action and loop in an available ARR engineer to help you. If you have any feedback of your Support Experience, please feel free to contact my manager – Kirk Beller at kbeller@microsoft.commailto:kbeller@microsoft.com, or +1 (701) 2816543. [Microsoft Logo]

From: Ze Qian Zhang notifications@github.com Sent: Wednesday, August 14, 2019 5:46 PM To: Azure/azure-storage-python azure-storage-python@noreply.github.com Cc: Purna Chandra Rao Chinta pchinta@microsoft.com; Mention mention@noreply.github.com Subject: Re: [Azure/azure-storage-python] MD5 validation using Storage SDK is not happening while databricks filesystem API reports MD5 mismatch (#627)

Hi @pchintahttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpchinta&data=02%7C01%7Cpchinta%40microsoft.com%7C6e72cd58a18d4f7db57008d721093a46%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014195828947831&sdata=wAKis%2BQVe%2BWIbk0dYLHiXYUtxcW%2FwfLjSK9dP5wGBM4%3D&reserved=0, thanks for reaching out.

Could you please provide a bit more details so that we can try to repro this issue? How big was the file? Have you validated the stored MD5 on your own?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-storage-python%2Fissues%2F627%3Femail_source%3Dnotifications%26email_token%3DAM3JFIHLMJHRZFNUS2XWYWTQESDLZA5CNFSM4ILZHVI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4KKTTA%23issuecomment-521447884&data=02%7C01%7Cpchinta%40microsoft.com%7C6e72cd58a18d4f7db57008d721093a46%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014195828947831&sdata=az6aXvMPX4uG9VqFwhvZpMDcATF21LGfrHehXe0ADXI%3D&reserved=0, or mute the threadhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAM3JFIEJ5QEJDVLATFWGRS3QESDLZANCNFSM4ILZHVIQ&data=02%7C01%7Cpchinta%40microsoft.com%7C6e72cd58a18d4f7db57008d721093a46%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014195828957825&sdata=T6WUgzH%2FjPkfVe2NyZiXfsfY6s3Pdt%2BPTQG49GZxhlc%3D&reserved=0.

at com.microsoft.azure.storage.core.Utility.initIOException(Utility.java:737) at com.microsoft.azure.storage.blob.BlobInputStream.readInternal(BlobInputStream.java:466) at com.microsoft.azure.storage.blob.BlobInputStream.read(BlobInputStream.java:420) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.read(DataInputStream.java:149) at shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsInputStream.read(NativeAzureFileSystem.java:855) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.readFully(DataInputStream.java:195) at java.io.DataInputStream.readFully(DataInputStream.java:169) at com.databricks.backend.daemon.dbutils.FSUtils$$anonfun$head$1.apply(DBUtilsCore.scala:200) at com.databricks.backend.daemon.dbutils.FSUtils$$anonfun$head$1.apply(DBUtilsCore.scala:190) at com.databricks.backend.daemon.dbutils.FSUtils$.com$databricks$backend$daemon$dbutils$FSUtils$$withFsSafetyCheck(DBUtilsCore.scala:81) at com.databricks.backend.daemon.dbutils.FSUtils$.head(DBUtilsCore.scala:190) at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.head(DbfsUtilsImpl.scala:53) at line75fe9781da5b4e23ae11050a735562e331.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-940836796107301:1) at line75fe9781da5b4e23ae11050a735562e331.$read$$iw$$iw$$iw$$iw$$iw.(command-940836796107301:44) at line75fe9781da5b4e23ae11050a735562e331.$read$$iw$$iw$$iw$$iw.(command-940836796107301:46) at line75fe9781da5b4e23ae11050a735562e331.$read$$iw$$iw$$iw.(command-940836796107301:48) at line75fe9781da5b4e23ae11050a735562e331.$read$$iw$$iw.(command-940836796107301:50) at line75fe9781da5b4e23ae11050a735562e331.$read$$iw.(command-940836796107301:52) at line75fe9781da5b4e23ae11050a735562e331.$read.(command-940836796107301:54) at line75fe9781da5b4e23ae11050a735562e331.$read$.(command-940836796107301:58) at line75fe9781da5b4e23ae11050a735562e331.$read$.(command-940836796107301) at line75fe9781da5b4e23ae11050a735562e331.$eval$.$print$lzycompute(:7) at line75fe9781da5b4e23ae11050a735562e331.$eval$.$print(:6) at line75fe9781da5b4e23ae11050a735562e331.$eval.$print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:793) at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1054) at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:645) at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:644) at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31) at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19) at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:644) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:576) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:572) at com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:215) at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply$mcV$sp(ScalaDriverLocal.scala:197) at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:197) at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:197) at com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:679) at com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:632) at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:197) at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$8.apply(DriverLocal.scala:368) at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$8.apply(DriverLocal.scala:345) at com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:238) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:233) at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:48) at com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:271) at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:48) at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:345) at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644) at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644) at scala.util.Try$.apply(Try.scala:192) at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:639) at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:485) at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:597) at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:390) at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:337) at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:219) at java.lang.Thread.run(Thread.java:748) Caused by: com.microsoft.azure.storage.StorageException: Blob data corrupted (integrity check failed), Expected value is c1b80a33c85d2b1cb092d92fc817324d, retrieved wbgKM8hdKxywktkvyBcyTQ== at com.microsoft.azure.storage.blob.BlobInputStream.readInternal(BlobInputStream.java:466) at com.microsoft.azure.storage.blob.BlobInputStream.read(BlobInputStream.java:420) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.read(DataInputStream.java:149) at shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsInputStream.read(NativeAzureFileSystem.java:855) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.readFully(DataInputStream.java:195) at java.io.DataInputStream.readFully(DataInputStream.java:169) at com.databricks.backend.daemon.dbutils.FSUtils$$anonfun$head$1.apply(DBUtilsCore.scala:200) at com.databricks.backend.daemon.dbutils.FSUtils$$anonfun$head$1.apply(DBUtilsCore.scala:190) at com.databricks.backend.daemon.dbutils.FSUtils$.com$databricks$backend$daemon$dbutils$FSUtils$$withFsSafetyCheck(DBUtilsCore.scala:81) at com.databricks.backend.daemon.dbutils.FSUtils$.head(DBUtilsCore.scala:190) at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.head(DbfsUtilsImpl.scala:53) at line75fe9781da5b4e23ae11050a735562e331.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-940836796107301:1) at line75fe9781da5b4e23ae11050a735562e331.$read$$iw$$iw$$iw$$iw$$iw.(command-940836796107301:44) at line75fe9781da5b4e23ae11050a735562e331.$read$$iw$$iw$$iw$$iw.(command-940836796107301:46) at line75fe9781da5b4e23ae11050a735562e331.$read$$iw$$iw$$iw.(command-940836796107301:48) at line75fe9781da5b4e23ae11050a735562e331.$read$$iw$$iw.(command-940836796107301:50) at line75fe9781da5b4e23ae11050a735562e331.$read$$iw.(command-940836796107301:52) at line75fe9781da5b4e23ae11050a735562e331.$read.(command-940836796107301:54) at line75fe9781da5b4e23ae11050a735562e331.$read$.(command-940836796107301:58) at line75fe9781da5b4e23ae11050a735562e331.$read$.(command-940836796107301) at line75fe9781da5b4e23ae11050a735562e331.$eval$.$print$lzycompute(:7) at line75fe9781da5b4e23ae11050a735562e331.$eval$.$print(:6) at line75fe9781da5b4e23ae11050a735562e331.$eval.$print() at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:793) at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1054) at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:645) at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:644) at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31) at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19) at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:644) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:576) at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:572) at com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:215) at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply$mcV$sp(ScalaDriverLocal.scala:197) at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:197) at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:197) at com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:679) at com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:632) at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:197) at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$8.apply(DriverLocal.scala:368) at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$8.apply(DriverLocal.scala:345) at com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:238) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:233) at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:48) at com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:271) at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:48) at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:345) at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644) at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644) at scala.util.Try$.apply(Try.scala:192) at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:639) at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:485) at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:597) at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:390) at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:337) at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:219) at java.lang.Thread.run(Thread.java:748)

pchinta commented 4 years ago

We are not explicitly computing the MD5 of the blob. I believe that functionality is provided out of the box via the Python SDK when the validate_content flag is set to True. I believe that there is some subtlety to the computation in that an MD5 is only evaluated for the first 4 MB with that flag set to true for performance purposes (see https://github.com/Azure/azure-storage-python/blob/master/azure-storage-blob/azure/storage/blob/baseblobservice.py#L1923 ), but the particular blob that we’re encountering hash mismatch errors with in Databricks mounted storage is less than 1 MB in size so this shouldn’t be a factor. I can’t comment with 100% certainty on how Databricks computes the MD5 with their mounted storage setup, but from the logs I’ve seen it looks like they’re using the Java SDK for Azure Storage under the hood.

Best, Blaize Berry Staff Software Engineer (Machine Learning) – Austin, TX M: 314-578-0629 Blaize.Berry@Walmart.commailto:Jason.Norris@Walmart.com [/Users/b0b00ci/Library/Containers/com.microsoft.Outlook/Data/Library/Caches/Signatures/signature_301523036]

From: Purna Chandra Rao Chinta pchinta@microsoft.com Date: Wednesday, August 14, 2019 at 6:00 PM To: Azure/azure-storage-python reply@reply.github.com, Azure/azure-storage-python azure-storage-python@noreply.github.com, Blaize Berry Blaize.Berry@walmart.com Cc: Mention mention@noreply.github.com, Microsoft - Alex Kaps Alexander.Kaps@microsoft.com Subject: EXT: RE: [Azure/azure-storage-python] MD5 validation using Storage SDK is not happening while databricks filesystem API reports MD5 mismatch (#627)

Hi,

The blob is stored on a storage account owned by Blaize (copied on the thread). Blob size is 4MB and it is outlook mail. Since its PII info we do not have copy of the blob, and so did not validate MD5 of the blob. Attached is the error reported for the blob in scope of the issue by databricks file system API for the same file.

Caused by: com.microsoft.azure.storage.StorageException: Blob data corrupted (integrity check failed), Expected value is c1b80a33c85d2b1cb092d92fc817324d, retrieved wbgKM8hdKxywktkvyBcyTQ== at com.microsoft.azure.storage.blob.BlobInputStream.readInternal(BlobInputStream.java:466) at com.microsoft.azure.storage.blob.BlobInputStream.read(BlobInputStream.java:420) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.read(DataInputStream.java:149)

@Blaize Berry Blaize.Berry@walmart.commailto:Blaize.Berry@walmart.com, Could you please help us understand on MD5 computation to setup repro by Azure Storage SDK team?

Best Regards, Purna Chandra Rao Chinta Support Escalation Engineer Microsoft Azure Rapid Response Team Email : pchinta@microsoft.commailto:pchinta@microsoft.com Working hours: 0900 – 2000 Central Time (Wed – Sat) If you need immediate attention for any of the ongoing case while I am out of shift, please send a separate email with case ID to arrbackup@microsoft.commailto:arrbackup@microsoft.com. A Duty Manager will take action and loop in an available ARR engineer to help you. If you have any feedback of your Support Experience, please feel free to contact my manager – Kirk Beller at kbeller@microsoft.commailto:kbeller@microsoft.com, or +1 (701) 2816543. [Microsoft Logo]

From: Ze Qian Zhang notifications@github.com Sent: Wednesday, August 14, 2019 5:46 PM To: Azure/azure-storage-python azure-storage-python@noreply.github.com Cc: Purna Chandra Rao Chinta pchinta@microsoft.com; Mention mention@noreply.github.com Subject: Re: [Azure/azure-storage-python] MD5 validation using Storage SDK is not happening while databricks filesystem API reports MD5 mismatch (#627)

Hi @pchintahttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpchinta&data=02%7C01%7Cpchinta%40microsoft.com%7C6e72cd58a18d4f7db57008d721093a46%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014195828947831&sdata=wAKis%2BQVe%2BWIbk0dYLHiXYUtxcW%2FwfLjSK9dP5wGBM4%3D&reserved=0, thanks for reaching out.

Could you please provide a bit more details so that we can try to repro this issue? How big was the file? Have you validated the stored MD5 on your own?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-storage-python%2Fissues%2F627%3Femail_source%3Dnotifications%26email_token%3DAM3JFIHLMJHRZFNUS2XWYWTQESDLZA5CNFSM4ILZHVI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4KKTTA%23issuecomment-521447884&data=02%7C01%7Cpchinta%40microsoft.com%7C6e72cd58a18d4f7db57008d721093a46%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014195828947831&sdata=az6aXvMPX4uG9VqFwhvZpMDcATF21LGfrHehXe0ADXI%3D&reserved=0, or mute the threadhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAM3JFIEJ5QEJDVLATFWGRS3QESDLZANCNFSM4ILZHVIQ&data=02%7C01%7Cpchinta%40microsoft.com%7C6e72cd58a18d4f7db57008d721093a46%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014195828957825&sdata=T6WUgzH%2FjPkfVe2NyZiXfsfY6s3Pdt%2BPTQG49GZxhlc%3D&reserved=0.

pchinta commented 4 years ago

Hi,

We can compute MD5 of the blob in storage account and see if MD5 being updated on blob is correct..

https://github.com/giventocode/azure-blob-md5

For windows:

  1. Install GO (as on link).
  2. Create a folder for md5 and open cmdprompt (as admin), browse to the folder and run below commands:

go get github.com/giventocode/azure-blob-md5

go build -o bmd5.exe github.com/giventocode/azure-blob-md5

  1. Create environment variables as ACCOUNT_NAME and ACCOUNT_KEY and provide values inline to storage account in scope of issue.
  2. Open another prompt (as admin) as environment variables created above will not be available at cmd prompt at step 2 above.
  3. Browse to “bmd5.exe” (at path on step 2) and run below command:

bmd5 -b blob -c container

  1. Validate the MD5 generated with

  2. MD5 reported on error (one of them)

  3. MD5 on blob (content-md5 on azure portal). If both match, then there is issue with md5 computed by databricks file system API.

Best Regards, Purna Chandra Rao Chinta From: Blaize Berry Blaize.Berry@walmart.com Sent: Thursday, August 15, 2019 12:32 AM To: Purna Chandra Rao Chinta pchinta@microsoft.com; Azure/azure-storage-python reply@reply.github.com; Azure/azure-storage-python azure-storage-python@noreply.github.com Cc: Mention mention@noreply.github.com; Alex Kaps Alexander.Kaps@microsoft.com Subject: Re: [Azure/azure-storage-python] MD5 validation using Storage SDK is not happening while databricks filesystem API reports MD5 mismatch (#627)

We are not explicitly computing the MD5 of the blob. I believe that functionality is provided out of the box via the Python SDK when the validate_content flag is set to True. I believe that there is some subtlety to the computation in that an MD5 is only evaluated for the first 4 MB with that flag set to true for performance purposes (see https://github.com/Azure/azure-storage-python/blob/master/azure-storage-blob/azure/storage/blob/baseblobservice.py#L1923https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-storage-python%2Fblob%2Fmaster%2Fazure-storage-blob%2Fazure%2Fstorage%2Fblob%2Fbaseblobservice.py%23L1923&data=02%7C01%7Cpchinta%40microsoft.com%7Cfdd36421135f4572c73208d72141fa20%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014439579426298&sdata=YBreKhiI43sdsMDxS%2FsU7xs1PkpXW3ec3zJbgeNF5s8%3D&reserved=0 ), but the particular blob that we’re encountering hash mismatch errors with in Databricks mounted storage is less than 1 MB in size so this shouldn’t be a factor. I can’t comment with 100% certainty on how Databricks computes the MD5 with their mounted storage setup, but from the logs I’ve seen it looks like they’re using the Java SDK for Azure Storage under the hood.

Best, Blaize Berry Staff Software Engineer (Machine Learning) – Austin, TX M: 314-578-0629 Blaize.Berry@Walmart.commailto:Jason.Norris@Walmart.com [/Users/b0b00ci/Library/Containers/com.microsoft.Outlook/Data/Library/Caches/Signatures/signature_301523036]

From: Purna Chandra Rao Chinta pchinta@microsoft.com<mailto:pchinta@microsoft.com> Date: Wednesday, August 14, 2019 at 6:00 PM To: Azure/azure-storage-python reply@reply.github.com<mailto:reply@reply.github.com>, Azure/azure-storage-python azure-storage-python@noreply.github.com<mailto:azure-storage-python@noreply.github.com>, Blaize Berry Blaize.Berry@walmart.com<mailto:Blaize.Berry@walmart.com> Cc: Mention mention@noreply.github.com<mailto:mention@noreply.github.com>, Microsoft - Alex Kaps Alexander.Kaps@microsoft.com<mailto:Alexander.Kaps@microsoft.com> Subject: EXT: RE: [Azure/azure-storage-python] MD5 validation using Storage SDK is not happening while databricks filesystem API reports MD5 mismatch (#627)

Hi,

The blob is stored on a storage account owned by Blaize (copied on the thread). Blob size is 4MB and it is outlook mail. Since its PII info we do not have copy of the blob, and so did not validate MD5 of the blob. Attached is the error reported for the blob in scope of the issue by databricks file system API for the same file.

Caused by: com.microsoft.azure.storage.StorageException: Blob data corrupted (integrity check failed), Expected value is c1b80a33c85d2b1cb092d92fc817324d, retrieved wbgKM8hdKxywktkvyBcyTQ== at com.microsoft.azure.storage.blob.BlobInputStream.readInternal(BlobInputStream.java:466) at com.microsoft.azure.storage.blob.BlobInputStream.read(BlobInputStream.java:420) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.read(DataInputStream.java:149)

@Blaize Berry Blaize.Berry@walmart.commailto:Blaize.Berry@walmart.com, Could you please help us understand on MD5 computation to setup repro by Azure Storage SDK team?

Best Regards, Purna Chandra Rao Chinta Support Escalation Engineer Microsoft Azure Rapid Response Team Email : pchinta@microsoft.commailto:pchinta@microsoft.com Working hours: 0900 – 2000 Central Time (Wed – Sat) If you need immediate attention for any of the ongoing case while I am out of shift, please send a separate email with case ID to arrbackup@microsoft.commailto:arrbackup@microsoft.com. A Duty Manager will take action and loop in an available ARR engineer to help you. If you have any feedback of your Support Experience, please feel free to contact my manager – Kirk Beller at kbeller@microsoft.commailto:kbeller@microsoft.com, or +1 (701) 2816543. [Microsoft Logo]

From: Ze Qian Zhang notifications@github.com<mailto:notifications@github.com> Sent: Wednesday, August 14, 2019 5:46 PM To: Azure/azure-storage-python azure-storage-python@noreply.github.com<mailto:azure-storage-python@noreply.github.com> Cc: Purna Chandra Rao Chinta pchinta@microsoft.com<mailto:pchinta@microsoft.com>; Mention mention@noreply.github.com<mailto:mention@noreply.github.com> Subject: Re: [Azure/azure-storage-python] MD5 validation using Storage SDK is not happening while databricks filesystem API reports MD5 mismatch (#627)

Hi @pchintahttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpchinta&data=02%7C01%7Cpchinta%40microsoft.com%7Cfdd36421135f4572c73208d72141fa20%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014439579426298&sdata=aycTaj60TMgqx4QlUZy7syClXoqsD4IN0s1uimXco%2FI%3D&reserved=0, thanks for reaching out.

Could you please provide a bit more details so that we can try to repro this issue? How big was the file? Have you validated the stored MD5 on your own?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-storage-python%2Fissues%2F627%3Femail_source%3Dnotifications%26email_token%3DAM3JFIHLMJHRZFNUS2XWYWTQESDLZA5CNFSM4ILZHVI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4KKTTA%23issuecomment-521447884&data=02%7C01%7Cpchinta%40microsoft.com%7Cfdd36421135f4572c73208d72141fa20%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014439579436293&sdata=%2BJO12gj6nurUQFeXeN%2B1l4oxjv%2Ftxh2a4JVTGBc4N9M%3D&reserved=0, or mute the threadhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAM3JFIEJ5QEJDVLATFWGRS3QESDLZANCNFSM4ILZHVIQ&data=02%7C01%7Cpchinta%40microsoft.com%7Cfdd36421135f4572c73208d72141fa20%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014439579446294&sdata=gvvVdf%2B8NlRhoEYktjq%2F%2FKuxec5u%2BAho81CPZvObYUE%3D&reserved=0.

pchinta commented 4 years ago

Hi Purna,

I have computed the MD5 of the blob in the storage account and got c1b80a33c85d2b1cb092d92fc817324d, which is consistent with the ContentMD5 blob property.

Best, Blaize Berry Staff Software Engineer (Machine Learning) – Austin, TX M: 314-578-0629 Blaize.Berry@Walmart.commailto:Jason.Norris@Walmart.com [/Users/b0b00ci/Library/Containers/com.microsoft.Outlook/Data/Library/Caches/Signatures/signature_301523036]

From: Purna Chandra Rao Chinta pchinta@microsoft.com Date: Thursday, August 15, 2019 at 7:25 AM To: Blaize Berry Blaize.Berry@walmart.com, Azure/azure-storage-python reply@reply.github.com, Azure/azure-storage-python azure-storage-python@noreply.github.com Cc: Mention mention@noreply.github.com, Microsoft - Alex Kaps Alexander.Kaps@microsoft.com Subject: EXT: RE: [Azure/azure-storage-python] MD5 validation using Storage SDK is not happening while databricks filesystem API reports MD5 mismatch (#627)

Hi,

We can compute MD5 of the blob in storage account and see if MD5 being updated on blob is correct..

https://github.com/giventocode/azure-blob-md5

For windows:

  1. Install GO (as on link).
  2. Create a folder for md5 and open cmdprompt (as admin), browse to the folder and run below commands:

go get github.com/giventocode/azure-blob-md5

go build -o bmd5.exe github.com/giventocode/azure-blob-md5

  1. Create environment variables as ACCOUNT_NAME and ACCOUNT_KEY and provide values inline to storage account in scope of issue.
  2. Open another prompt (as admin) as environment variables created above will not be available at cmd prompt at step 2 above.
  3. Browse to “bmd5.exe” (at path on step 2) and run below command:

bmd5 -b blob -c container

  1. Validate the MD5 generated with

  2. MD5 reported on error (one of them)

  3. MD5 on blob (content-md5 on azure portal). If both match, then there is issue with md5 computed by databricks file system API.

Best Regards, Purna Chandra Rao Chinta From: Blaize Berry Blaize.Berry@walmart.com Sent: Thursday, August 15, 2019 12:32 AM To: Purna Chandra Rao Chinta pchinta@microsoft.com; Azure/azure-storage-python reply@reply.github.com; Azure/azure-storage-python azure-storage-python@noreply.github.com Cc: Mention mention@noreply.github.com; Alex Kaps Alexander.Kaps@microsoft.com Subject: Re: [Azure/azure-storage-python] MD5 validation using Storage SDK is not happening while databricks filesystem API reports MD5 mismatch (#627)

We are not explicitly computing the MD5 of the blob. I believe that functionality is provided out of the box via the Python SDK when the validate_content flag is set to True. I believe that there is some subtlety to the computation in that an MD5 is only evaluated for the first 4 MB with that flag set to true for performance purposes (see https://github.com/Azure/azure-storage-python/blob/master/azure-storage-blob/azure/storage/blob/baseblobservice.py#L1923https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-storage-python%2Fblob%2Fmaster%2Fazure-storage-blob%2Fazure%2Fstorage%2Fblob%2Fbaseblobservice.py%23L1923&data=02%7C01%7Cpchinta%40microsoft.com%7Cfdd36421135f4572c73208d72141fa20%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014439579426298&sdata=YBreKhiI43sdsMDxS%2FsU7xs1PkpXW3ec3zJbgeNF5s8%3D&reserved=0 ), but the particular blob that we’re encountering hash mismatch errors with in Databricks mounted storage is less than 1 MB in size so this shouldn’t be a factor. I can’t comment with 100% certainty on how Databricks computes the MD5 with their mounted storage setup, but from the logs I’ve seen it looks like they’re using the Java SDK for Azure Storage under the hood.

Best, Blaize Berry Staff Software Engineer (Machine Learning) – Austin, TX M: 314-578-0629 Blaize.Berry@Walmart.commailto:Jason.Norris@Walmart.com [/Users/b0b00ci/Library/Containers/com.microsoft.Outlook/Data/Library/Caches/Signatures/signature_301523036]

From: Purna Chandra Rao Chinta pchinta@microsoft.com<mailto:pchinta@microsoft.com> Date: Wednesday, August 14, 2019 at 6:00 PM To: Azure/azure-storage-python reply@reply.github.com<mailto:reply@reply.github.com>, Azure/azure-storage-python azure-storage-python@noreply.github.com<mailto:azure-storage-python@noreply.github.com>, Blaize Berry Blaize.Berry@walmart.com<mailto:Blaize.Berry@walmart.com> Cc: Mention mention@noreply.github.com<mailto:mention@noreply.github.com>, Microsoft - Alex Kaps Alexander.Kaps@microsoft.com<mailto:Alexander.Kaps@microsoft.com> Subject: EXT: RE: [Azure/azure-storage-python] MD5 validation using Storage SDK is not happening while databricks filesystem API reports MD5 mismatch (#627)

Hi,

The blob is stored on a storage account owned by Blaize (copied on the thread). Blob size is 4MB and it is outlook mail. Since its PII info we do not have copy of the blob, and so did not validate MD5 of the blob. Attached is the error reported for the blob in scope of the issue by databricks file system API for the same file.

Caused by: com.microsoft.azure.storage.StorageException: Blob data corrupted (integrity check failed), Expected value is c1b80a33c85d2b1cb092d92fc817324d, retrieved wbgKM8hdKxywktkvyBcyTQ== at com.microsoft.azure.storage.blob.BlobInputStream.readInternal(BlobInputStream.java:466) at com.microsoft.azure.storage.blob.BlobInputStream.read(BlobInputStream.java:420) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.read(DataInputStream.java:149)

@Blaize Berry Blaize.Berry@walmart.commailto:Blaize.Berry@walmart.com, Could you please help us understand on MD5 computation to setup repro by Azure Storage SDK team?

Best Regards, Purna Chandra Rao Chinta Support Escalation Engineer Microsoft Azure Rapid Response Team Email : pchinta@microsoft.commailto:pchinta@microsoft.com Working hours: 0900 – 2000 Central Time (Wed – Sat) If you need immediate attention for any of the ongoing case while I am out of shift, please send a separate email with case ID to arrbackup@microsoft.commailto:arrbackup@microsoft.com. A Duty Manager will take action and loop in an available ARR engineer to help you. If you have any feedback of your Support Experience, please feel free to contact my manager – Kirk Beller at kbeller@microsoft.commailto:kbeller@microsoft.com, or +1 (701) 2816543. [Microsoft Logo]

From: Ze Qian Zhang notifications@github.com<mailto:notifications@github.com> Sent: Wednesday, August 14, 2019 5:46 PM To: Azure/azure-storage-python azure-storage-python@noreply.github.com<mailto:azure-storage-python@noreply.github.com> Cc: Purna Chandra Rao Chinta pchinta@microsoft.com<mailto:pchinta@microsoft.com>; Mention mention@noreply.github.com<mailto:mention@noreply.github.com> Subject: Re: [Azure/azure-storage-python] MD5 validation using Storage SDK is not happening while databricks filesystem API reports MD5 mismatch (#627)

Hi @pchintahttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpchinta&data=02%7C01%7Cpchinta%40microsoft.com%7Cfdd36421135f4572c73208d72141fa20%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014439579426298&sdata=aycTaj60TMgqx4QlUZy7syClXoqsD4IN0s1uimXco%2FI%3D&reserved=0, thanks for reaching out.

Could you please provide a bit more details so that we can try to repro this issue? How big was the file? Have you validated the stored MD5 on your own?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-storage-python%2Fissues%2F627%3Femail_source%3Dnotifications%26email_token%3DAM3JFIHLMJHRZFNUS2XWYWTQESDLZA5CNFSM4ILZHVI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4KKTTA%23issuecomment-521447884&data=02%7C01%7Cpchinta%40microsoft.com%7Cfdd36421135f4572c73208d72141fa20%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014439579436293&sdata=%2BJO12gj6nurUQFeXeN%2B1l4oxjv%2Ftxh2a4JVTGBc4N9M%3D&reserved=0, or mute the threadhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAM3JFIEJ5QEJDVLATFWGRS3QESDLZANCNFSM4ILZHVIQ&data=02%7C01%7Cpchinta%40microsoft.com%7Cfdd36421135f4572c73208d72141fa20%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014439579446294&sdata=gvvVdf%2B8NlRhoEYktjq%2F%2FKuxec5u%2BAho81CPZvObYUE%3D&reserved=0.

pchinta commented 4 years ago

Also, when I try to retrieve the blob with the Databricks fs utilities I get the error below. Please note the checksum values that I’ve bolded in the stack trace. As you said, it appears that the Databricks FSUtils is computing the checksum incorrectly:

ExecutionError: An error occurred while calling z:com.databricks.backend.daemon.dbutils.FSUtils.head. : java.io.IOException at com.microsoft.azure.storage.core.Utility.initIOException(Utility.java:737) at com.microsoft.azure.storage.blob.BlobInputStream.readInternal(BlobInputStream.java:466) at com.microsoft.azure.storage.blob.BlobInputStream.read(BlobInputStream.java:420) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.read(DataInputStream.java:149) at shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsInputStream.read(NativeAzureFileSystem.java:855) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.readFully(DataInputStream.java:195) at java.io.DataInputStream.readFully(DataInputStream.java:169) at com.databricks.backend.daemon.dbutils.FSUtils$$anonfun$head$1.apply(DBUtilsCore.scala:200) at com.databricks.backend.daemon.dbutils.FSUtils$$anonfun$head$1.apply(DBUtilsCore.scala:190) at com.databricks.backend.daemon.dbutils.FSUtils$.com$databricks$backend$daemon$dbutils$FSUtils$$withFsSafetyCheck(DBUtilsCore.scala:81) at com.databricks.backend.daemon.dbutils.FSUtils$.head(DBUtilsCore.scala:190) at com.databricks.backend.daemon.dbutils.FSUtils.head(DBUtilsCore.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:295) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:251) at java.lang.Thread.run(Thread.java:748) Caused by: com.microsoft.azure.storage.StorageException: Blob data corrupted (integrity check failed), Expected value is c1b80a33c85d2b1cb092d92fc817324d, retrieved wbgKM8hdKxywktkvyBcyTQ== ... 27 more

Best, Blaize Berry Staff Software Engineer (Machine Learning) – Austin, TX M: 314-578-0629 Blaize.Berry@Walmart.commailto:Jason.Norris@Walmart.com [/Users/b0b00ci/Library/Containers/com.microsoft.Outlook/Data/Library/Caches/Signatures/signature_301523036]

From: Blaize Berry Blaize.Berry@walmart.com Date: Thursday, August 15, 2019 at 9:35 AM To: Purna Chandra Rao Chinta pchinta@microsoft.com, Azure/azure-storage-python reply@reply.github.com, Azure/azure-storage-python azure-storage-python@noreply.github.com Cc: Mention mention@noreply.github.com, Microsoft - Alex Kaps Alexander.Kaps@microsoft.com Subject: Re: [Azure/azure-storage-python] MD5 validation using Storage SDK is not happening while databricks filesystem API reports MD5 mismatch (#627)

Hi Purna,

I have computed the MD5 of the blob in the storage account and got c1b80a33c85d2b1cb092d92fc817324d, which is consistent with the ContentMD5 blob property.

Best, Blaize Berry Staff Software Engineer (Machine Learning) – Austin, TX M: 314-578-0629 Blaize.Berry@Walmart.commailto:Jason.Norris@Walmart.com [/Users/b0b00ci/Library/Containers/com.microsoft.Outlook/Data/Library/Caches/Signatures/signature_301523036]

From: Purna Chandra Rao Chinta pchinta@microsoft.com Date: Thursday, August 15, 2019 at 7:25 AM To: Blaize Berry Blaize.Berry@walmart.com, Azure/azure-storage-python reply@reply.github.com, Azure/azure-storage-python azure-storage-python@noreply.github.com Cc: Mention mention@noreply.github.com, Microsoft - Alex Kaps Alexander.Kaps@microsoft.com Subject: EXT: RE: [Azure/azure-storage-python] MD5 validation using Storage SDK is not happening while databricks filesystem API reports MD5 mismatch (#627)

Hi,

We can compute MD5 of the blob in storage account and see if MD5 being updated on blob is correct..

https://github.com/giventocode/azure-blob-md5

For windows:

  1. Install GO (as on link).
  2. Create a folder for md5 and open cmdprompt (as admin), browse to the folder and run below commands:

go get github.com/giventocode/azure-blob-md5

go build -o bmd5.exe github.com/giventocode/azure-blob-md5

  1. Create environment variables as ACCOUNT_NAME and ACCOUNT_KEY and provide values inline to storage account in scope of issue.
  2. Open another prompt (as admin) as environment variables created above will not be available at cmd prompt at step 2 above.
  3. Browse to “bmd5.exe” (at path on step 2) and run below command:

bmd5 -b blob -c container

  1. Validate the MD5 generated with

  2. MD5 reported on error (one of them)

  3. MD5 on blob (content-md5 on azure portal). If both match, then there is issue with md5 computed by databricks file system API.

Best Regards, Purna Chandra Rao Chinta From: Blaize Berry Blaize.Berry@walmart.com Sent: Thursday, August 15, 2019 12:32 AM To: Purna Chandra Rao Chinta pchinta@microsoft.com; Azure/azure-storage-python reply@reply.github.com; Azure/azure-storage-python azure-storage-python@noreply.github.com Cc: Mention mention@noreply.github.com; Alex Kaps Alexander.Kaps@microsoft.com Subject: Re: [Azure/azure-storage-python] MD5 validation using Storage SDK is not happening while databricks filesystem API reports MD5 mismatch (#627)

We are not explicitly computing the MD5 of the blob. I believe that functionality is provided out of the box via the Python SDK when the validate_content flag is set to True. I believe that there is some subtlety to the computation in that an MD5 is only evaluated for the first 4 MB with that flag set to true for performance purposes (see https://github.com/Azure/azure-storage-python/blob/master/azure-storage-blob/azure/storage/blob/baseblobservice.py#L1923https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-storage-python%2Fblob%2Fmaster%2Fazure-storage-blob%2Fazure%2Fstorage%2Fblob%2Fbaseblobservice.py%23L1923&data=02%7C01%7Cpchinta%40microsoft.com%7Cfdd36421135f4572c73208d72141fa20%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014439579426298&sdata=YBreKhiI43sdsMDxS%2FsU7xs1PkpXW3ec3zJbgeNF5s8%3D&reserved=0 ), but the particular blob that we’re encountering hash mismatch errors with in Databricks mounted storage is less than 1 MB in size so this shouldn’t be a factor. I can’t comment with 100% certainty on how Databricks computes the MD5 with their mounted storage setup, but from the logs I’ve seen it looks like they’re using the Java SDK for Azure Storage under the hood.

Best, Blaize Berry Staff Software Engineer (Machine Learning) – Austin, TX M: 314-578-0629 Blaize.Berry@Walmart.commailto:Jason.Norris@Walmart.com [/Users/b0b00ci/Library/Containers/com.microsoft.Outlook/Data/Library/Caches/Signatures/signature_301523036]

From: Purna Chandra Rao Chinta pchinta@microsoft.com<mailto:pchinta@microsoft.com> Date: Wednesday, August 14, 2019 at 6:00 PM To: Azure/azure-storage-python reply@reply.github.com<mailto:reply@reply.github.com>, Azure/azure-storage-python azure-storage-python@noreply.github.com<mailto:azure-storage-python@noreply.github.com>, Blaize Berry Blaize.Berry@walmart.com<mailto:Blaize.Berry@walmart.com> Cc: Mention mention@noreply.github.com<mailto:mention@noreply.github.com>, Microsoft - Alex Kaps Alexander.Kaps@microsoft.com<mailto:Alexander.Kaps@microsoft.com> Subject: EXT: RE: [Azure/azure-storage-python] MD5 validation using Storage SDK is not happening while databricks filesystem API reports MD5 mismatch (#627)

Hi,

The blob is stored on a storage account owned by Blaize (copied on the thread). Blob size is 4MB and it is outlook mail. Since its PII info we do not have copy of the blob, and so did not validate MD5 of the blob. Attached is the error reported for the blob in scope of the issue by databricks file system API for the same file.

Caused by: com.microsoft.azure.storage.StorageException: Blob data corrupted (integrity check failed), Expected value is c1b80a33c85d2b1cb092d92fc817324d, retrieved wbgKM8hdKxywktkvyBcyTQ== at com.microsoft.azure.storage.blob.BlobInputStream.readInternal(BlobInputStream.java:466) at com.microsoft.azure.storage.blob.BlobInputStream.read(BlobInputStream.java:420) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.read(DataInputStream.java:149)

@Blaize Berry Blaize.Berry@walmart.commailto:Blaize.Berry@walmart.com, Could you please help us understand on MD5 computation to setup repro by Azure Storage SDK team?

Best Regards, Purna Chandra Rao Chinta Support Escalation Engineer Microsoft Azure Rapid Response Team Email : pchinta@microsoft.commailto:pchinta@microsoft.com Working hours: 0900 – 2000 Central Time (Wed – Sat) If you need immediate attention for any of the ongoing case while I am out of shift, please send a separate email with case ID to arrbackup@microsoft.commailto:arrbackup@microsoft.com. A Duty Manager will take action and loop in an available ARR engineer to help you. If you have any feedback of your Support Experience, please feel free to contact my manager – Kirk Beller at kbeller@microsoft.commailto:kbeller@microsoft.com, or +1 (701) 2816543. [Microsoft Logo]

From: Ze Qian Zhang notifications@github.com<mailto:notifications@github.com> Sent: Wednesday, August 14, 2019 5:46 PM To: Azure/azure-storage-python azure-storage-python@noreply.github.com<mailto:azure-storage-python@noreply.github.com> Cc: Purna Chandra Rao Chinta pchinta@microsoft.com<mailto:pchinta@microsoft.com>; Mention mention@noreply.github.com<mailto:mention@noreply.github.com> Subject: Re: [Azure/azure-storage-python] MD5 validation using Storage SDK is not happening while databricks filesystem API reports MD5 mismatch (#627)

Hi @pchintahttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpchinta&data=02%7C01%7Cpchinta%40microsoft.com%7Cfdd36421135f4572c73208d72141fa20%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014439579426298&sdata=aycTaj60TMgqx4QlUZy7syClXoqsD4IN0s1uimXco%2FI%3D&reserved=0, thanks for reaching out.

Could you please provide a bit more details so that we can try to repro this issue? How big was the file? Have you validated the stored MD5 on your own?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-storage-python%2Fissues%2F627%3Femail_source%3Dnotifications%26email_token%3DAM3JFIHLMJHRZFNUS2XWYWTQESDLZA5CNFSM4ILZHVI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4KKTTA%23issuecomment-521447884&data=02%7C01%7Cpchinta%40microsoft.com%7Cfdd36421135f4572c73208d72141fa20%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014439579436293&sdata=%2BJO12gj6nurUQFeXeN%2B1l4oxjv%2Ftxh2a4JVTGBc4N9M%3D&reserved=0, or mute the threadhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAM3JFIEJ5QEJDVLATFWGRS3QESDLZANCNFSM4ILZHVIQ&data=02%7C01%7Cpchinta%40microsoft.com%7Cfdd36421135f4572c73208d72141fa20%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014439579446294&sdata=gvvVdf%2B8NlRhoEYktjq%2F%2FKuxec5u%2BAho81CPZvObYUE%3D&reserved=0.

pchinta commented 4 years ago

Hi Blaize,

As we understood from the call, content-MD5 is being computed manually by your application and is the reason we were seeing different MD5 value.

As agreed we will close this thread. Thank you for your help.

Best Regards, Purna Chandra Rao Chinta

From: Blaize Berry Blaize.Berry@walmart.com Sent: Thursday, August 15, 2019 9:44 AM To: Purna Chandra Rao Chinta pchinta@microsoft.com; Azure/azure-storage-python reply@reply.github.com; Azure/azure-storage-python azure-storage-python@noreply.github.com Cc: Mention mention@noreply.github.com; Alex Kaps Alexander.Kaps@microsoft.com Subject: Re: [Azure/azure-storage-python] MD5 validation using Storage SDK is not happening while databricks filesystem API reports MD5 mismatch (#627)

Also, when I try to retrieve the blob with the Databricks fs utilities I get the error below. Please note the checksum values that I’ve bolded in the stack trace. As you said, it appears that the Databricks FSUtils is computing the checksum incorrectly:

ExecutionError: An error occurred while calling z:com.databricks.backend.daemon.dbutils.FSUtils.head. : java.io.IOException at com.microsoft.azure.storage.core.Utility.initIOException(Utility.java:737) at com.microsoft.azure.storage.blob.BlobInputStream.readInternal(BlobInputStream.java:466) at com.microsoft.azure.storage.blob.BlobInputStream.read(BlobInputStream.java:420) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.read(DataInputStream.java:149) at shaded.databricks.org.apache.hadoop.fs.azure.NativeAzureFileSystem$NativeAzureFsInputStream.read(NativeAzureFileSystem.java:855) at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.readFully(DataInputStream.java:195) at java.io.DataInputStream.readFully(DataInputStream.java:169) at com.databricks.backend.daemon.dbutils.FSUtils$$anonfun$head$1.apply(DBUtilsCore.scala:200) at com.databricks.backend.daemon.dbutils.FSUtils$$anonfun$head$1.apply(DBUtilsCore.scala:190) at com.databricks.backend.daemon.dbutils.FSUtils$.com$databricks$backend$daemon$dbutils$FSUtils$$withFsSafetyCheck(DBUtilsCore.scala:81) at com.databricks.backend.daemon.dbutils.FSUtils$.head(DBUtilsCore.scala:190) at com.databricks.backend.daemon.dbutils.FSUtils.head(DBUtilsCore.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at py4j.Gateway.invoke(Gateway.java:295) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:251) at java.lang.Thread.run(Thread.java:748) Caused by: com.microsoft.azure.storage.StorageException: Blob data corrupted (integrity check failed), Expected value is c1b80a33c85d2b1cb092d92fc817324d, retrieved wbgKM8hdKxywktkvyBcyTQ== ... 27 more

Best, Blaize Berry Staff Software Engineer (Machine Learning) – Austin, TX M: 314-578-0629 Blaize.Berry@Walmart.commailto:Jason.Norris@Walmart.com [/Users/b0b00ci/Library/Containers/com.microsoft.Outlook/Data/Library/Caches/Signatures/signature_301523036]

From: Blaize Berry Blaize.Berry@walmart.com Date: Thursday, August 15, 2019 at 9:35 AM To: Purna Chandra Rao Chinta pchinta@microsoft.com, Azure/azure-storage-python reply@reply.github.com, Azure/azure-storage-python azure-storage-python@noreply.github.com Cc: Mention mention@noreply.github.com, Microsoft - Alex Kaps Alexander.Kaps@microsoft.com Subject: Re: [Azure/azure-storage-python] MD5 validation using Storage SDK is not happening while databricks filesystem API reports MD5 mismatch (#627)

Hi Purna,

I have computed the MD5 of the blob in the storage account and got c1b80a33c85d2b1cb092d92fc817324d, which is consistent with the ContentMD5 blob property.

Best, Blaize Berry Staff Software Engineer (Machine Learning) – Austin, TX M: 314-578-0629 Blaize.Berry@Walmart.commailto:Jason.Norris@Walmart.com [/Users/b0b00ci/Library/Containers/com.microsoft.Outlook/Data/Library/Caches/Signatures/signature_301523036]

From: Purna Chandra Rao Chinta pchinta@microsoft.com Date: Thursday, August 15, 2019 at 7:25 AM To: Blaize Berry Blaize.Berry@walmart.com, Azure/azure-storage-python reply@reply.github.com, Azure/azure-storage-python azure-storage-python@noreply.github.com Cc: Mention mention@noreply.github.com, Microsoft - Alex Kaps Alexander.Kaps@microsoft.com Subject: EXT: RE: [Azure/azure-storage-python] MD5 validation using Storage SDK is not happening while databricks filesystem API reports MD5 mismatch (#627)

Hi,

We can compute MD5 of the blob in storage account and see if MD5 being updated on blob is correct..

https://github.com/giventocode/azure-blob-md5https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgiventocode%2Fazure-blob-md5&data=02%7C01%7Cpchinta%40microsoft.com%7C25dcf8ca614347b6739008d7218f0a97%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014770570283669&sdata=1kKyzLlU3a5FXf9XDYc0yYQRSg800RpLVGvnQuoUCjM%3D&reserved=0

For windows:

  1. Install GO (as on link).
  2. Create a folder for md5 and open cmdprompt (as admin), browse to the folder and run below commands:

go get github.com/giventocode/azure-blob-md5

go build -o bmd5.exe github.com/giventocode/azure-blob-md5

  1. Create environment variables as ACCOUNT_NAME and ACCOUNT_KEY and provide values inline to storage account in scope of issue.
  2. Open another prompt (as admin) as environment variables created above will not be available at cmd prompt at step 2 above.
  3. Browse to “bmd5.exe” (at path on step 2) and run below command:

bmd5 -b blob -c container

  1. Validate the MD5 generated with

  2. MD5 reported on error (one of them)

  3. MD5 on blob (content-md5 on azure portal). If both match, then there is issue with md5 computed by databricks file system API.

Best Regards, Purna Chandra Rao Chinta From: Blaize Berry Blaize.Berry@walmart.com Sent: Thursday, August 15, 2019 12:32 AM To: Purna Chandra Rao Chinta pchinta@microsoft.com; Azure/azure-storage-python reply@reply.github.com; Azure/azure-storage-python azure-storage-python@noreply.github.com Cc: Mention mention@noreply.github.com; Alex Kaps Alexander.Kaps@microsoft.com Subject: Re: [Azure/azure-storage-python] MD5 validation using Storage SDK is not happening while databricks filesystem API reports MD5 mismatch (#627)

We are not explicitly computing the MD5 of the blob. I believe that functionality is provided out of the box via the Python SDK when the validate_content flag is set to True. I believe that there is some subtlety to the computation in that an MD5 is only evaluated for the first 4 MB with that flag set to true for performance purposes (see https://github.com/Azure/azure-storage-python/blob/master/azure-storage-blob/azure/storage/blob/baseblobservice.py#L1923https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-storage-python%2Fblob%2Fmaster%2Fazure-storage-blob%2Fazure%2Fstorage%2Fblob%2Fbaseblobservice.py%23L1923&data=02%7C01%7Cpchinta%40microsoft.com%7C25dcf8ca614347b6739008d7218f0a97%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014770570293669&sdata=I1ALBsr%2FcOvZ2bc%2FNCiHlJWp8l8b%2FrOgZKlUjYuFh5E%3D&reserved=0 ), but the particular blob that we’re encountering hash mismatch errors with in Databricks mounted storage is less than 1 MB in size so this shouldn’t be a factor. I can’t comment with 100% certainty on how Databricks computes the MD5 with their mounted storage setup, but from the logs I’ve seen it looks like they’re using the Java SDK for Azure Storage under the hood.

Best, Blaize Berry Staff Software Engineer (Machine Learning) – Austin, TX M: 314-578-0629 Blaize.Berry@Walmart.commailto:Jason.Norris@Walmart.com [/Users/b0b00ci/Library/Containers/com.microsoft.Outlook/Data/Library/Caches/Signatures/signature_301523036]

From: Purna Chandra Rao Chinta pchinta@microsoft.com<mailto:pchinta@microsoft.com> Date: Wednesday, August 14, 2019 at 6:00 PM To: Azure/azure-storage-python reply@reply.github.com<mailto:reply@reply.github.com>, Azure/azure-storage-python azure-storage-python@noreply.github.com<mailto:azure-storage-python@noreply.github.com>, Blaize Berry Blaize.Berry@walmart.com<mailto:Blaize.Berry@walmart.com> Cc: Mention mention@noreply.github.com<mailto:mention@noreply.github.com>, Microsoft - Alex Kaps Alexander.Kaps@microsoft.com<mailto:Alexander.Kaps@microsoft.com> Subject: EXT: RE: [Azure/azure-storage-python] MD5 validation using Storage SDK is not happening while databricks filesystem API reports MD5 mismatch (#627)

Hi,

The blob is stored on a storage account owned by Blaize (copied on the thread). Blob size is 4MB and it is outlook mail. Since its PII info we do not have copy of the blob, and so did not validate MD5 of the blob. Attached is the error reported for the blob in scope of the issue by databricks file system API for the same file.

Caused by: com.microsoft.azure.storage.StorageException: Blob data corrupted (integrity check failed), Expected value is c1b80a33c85d2b1cb092d92fc817324d, retrieved wbgKM8hdKxywktkvyBcyTQ== at com.microsoft.azure.storage.blob.BlobInputStream.readInternal(BlobInputStream.java:466) at com.microsoft.azure.storage.blob.BlobInputStream.read(BlobInputStream.java:420) at java.io.BufferedInputStream.read1(BufferedInputStream.java:284) at java.io.BufferedInputStream.read(BufferedInputStream.java:345) at java.io.DataInputStream.read(DataInputStream.java:149)

@Blaize Berry Blaize.Berry@walmart.commailto:Blaize.Berry@walmart.com, Could you please help us understand on MD5 computation to setup repro by Azure Storage SDK team?

Best Regards, Purna Chandra Rao Chinta Support Escalation Engineer Microsoft Azure Rapid Response Team Email : pchinta@microsoft.commailto:pchinta@microsoft.com Working hours: 0900 – 2000 Central Time (Wed – Sat) If you need immediate attention for any of the ongoing case while I am out of shift, please send a separate email with case ID to arrbackup@microsoft.commailto:arrbackup@microsoft.com. A Duty Manager will take action and loop in an available ARR engineer to help you. If you have any feedback of your Support Experience, please feel free to contact my manager – Kirk Beller at kbeller@microsoft.commailto:kbeller@microsoft.com, or +1 (701) 2816543. [Microsoft Logo]

From: Ze Qian Zhang notifications@github.com<mailto:notifications@github.com> Sent: Wednesday, August 14, 2019 5:46 PM To: Azure/azure-storage-python azure-storage-python@noreply.github.com<mailto:azure-storage-python@noreply.github.com> Cc: Purna Chandra Rao Chinta pchinta@microsoft.com<mailto:pchinta@microsoft.com>; Mention mention@noreply.github.com<mailto:mention@noreply.github.com> Subject: Re: [Azure/azure-storage-python] MD5 validation using Storage SDK is not happening while databricks filesystem API reports MD5 mismatch (#627)

Hi @pchintahttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpchinta&data=02%7C01%7Cpchinta%40microsoft.com%7C25dcf8ca614347b6739008d7218f0a97%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014770570293669&sdata=o3o4j5aw8zfBuGd8v%2F3DE5LZmy1JU3gOFk%2BH6IcbkF4%3D&reserved=0, thanks for reaching out.

Could you please provide a bit more details so that we can try to repro this issue? How big was the file? Have you validated the stored MD5 on your own?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2Fazure-storage-python%2Fissues%2F627%3Femail_source%3Dnotifications%26email_token%3DAM3JFIHLMJHRZFNUS2XWYWTQESDLZA5CNFSM4ILZHVI2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4KKTTA%23issuecomment-521447884&data=02%7C01%7Cpchinta%40microsoft.com%7C25dcf8ca614347b6739008d7218f0a97%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014770570303661&sdata=GFQmfmIDZCLwv5Kd5jFIBTFyI%2FgOIS94oMJYszj5OLk%3D&reserved=0, or mute the threadhttps://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAM3JFIEJ5QEJDVLATFWGRS3QESDLZANCNFSM4ILZHVIQ&data=02%7C01%7Cpchinta%40microsoft.com%7C25dcf8ca614347b6739008d7218f0a97%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637014770570303661&sdata=0neopASTKK6ohL0p1e%2F%2BZmgIMWCHphrhyvzmdkhUFDA%3D&reserved=0.

zezha-msft commented 4 years ago

@pchinta I see the issue was resolved.

To clarify, validate_content verifies the individual chunks when downloading a large blob.

On a side note, please avoid exposing cx's contact information on Github.