delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
Apache License 2.0
1.98k stars 365 forks source link

CaUsedAsEndEntity error in Microsoft Fabric #2449

Open martroben opened 2 months ago

martroben commented 2 months ago


Delta-rs version: Python deltalake-0.17.1 Cloud provider: Microsoft (North Europe) Environment: Microsoft Fabric Notebook OS: Fabric VM (CBL-Mariner Linux)


What happened: While trying to create a Delta Table from a path in a Fabric Notebook, I'm getting the following error: OSError: Generic MicrosoftAzure error: Error after 10 retries in 1.992440924s, max_retries:10, retry_timeout:180s, source:error sending request for url (<workspace id>/<lakehouse id>/Tables/some_table/_delta_log/_last_checkpoint): error trying to connect: invalid peer certificate: Other(CaUsedAsEndEntity)

What you expected to happen: To get a deltalake.DeltaTable instance without any errors.

How to reproduce it: Run the following code in a Fabric Notebook:

import deltalake
import trident_token_library_wrapper

workspace_id = "xxxxxxxx-xxxxx-xxxxx-xxxxx-xxxxxxxxxxxxxx"
lakehouse_id = "yyyyyyyy-yyyyy-yyyyy-yyyyy-yyyyyyyyyyyyy"
path = f"abfss://{workspace_id}{lakehouse_id}/Tables/some_table"

storage_options = {}
storage_options["bearer_token"] = trident_token_library_wrapper.PyTridentTokenLibrary.get_access_token("storage")
storage_options["use_fabric_endpoint"] = "true"
# storage_options["allow_invalid_certificates"] = "true"


More details: storage_options["allow_invalid_certificates"] = "true" can be used as a quickfix.

Here are the certificate details fetched by openssl s_client -showcerts -connect in the Fabric Notebook:

depth=0 C = US, ST = Washington, L = Redmond, O = MicrosoftData, OU = SparkDepartment, emailAddress =, CN =
verify return:1
Certificate chain
 0 s:C = US, ST = Washington, L = Redmond, O = MicrosoftData, OU = SparkDepartment, emailAddress =, CN =
   i:C = US, ST = Washington, L = Redmond, O = MicrosoftData, OU = SparkDepartment, emailAddress =, CN =
Server certificate
subject=C = US, ST = Washington, L = Redmond, O = MicrosoftData, OU = SparkDepartment, emailAddress =, CN =

issuer=C = US, ST = Washington, L = Redmond, O = MicrosoftData, OU = SparkDepartment, emailAddress =, CN =

No client certificate CA names sent
Peer signing digest: SHA256
Peer signature type: RSA-PSS
Server Temp Key: X25519, 253 bits
SSL handshake has read 1881 bytes and written 407 bytes
Verification: OK
New, TLSv1.3, Cipher is TLS_AES_256_GCM_SHA384
Server public key is 2048 bit
Secure Renegotiation IS NOT supported
No ALPN negotiated
Early data was not sent
Verify return code: 0 (ok)

It doesn't seem to be a Certificate Authority certificate. More like a self-signed certificate, so I don't know why the error is CaUsedAsEndEntity.

Interestingly, the same openssl operation used to give a self signed certificate error (see this deltalake issue for details), but it seems that something has changed in the openssl setup of the underlying Fabric VMs.

If anyone has any ideas for how to start solving this new issue (other than using the "allow_invalid_certificates"-hammer in perpetuity), I would be most thankful.

martroben commented 2 months ago

Upon further investigation, the correct command to check whether OneLake certificate is Certificate Authority or End Entity certificate is this:

openssl s_client -connect -showcerts | openssl x509 -text | grep "Basic Constraints" -A 1

This returns CA:TRUE, signifying that the certificate used is in fact a Certificate Authority cert, as the error suggests.

I'm not expecting Microsoft to alter their certificates to make it easier for people to use Polars in Fabric, so some workaround would still be appreciated.

Can anyone tell, what is the underlying module or crate that is giving the CaUsedAsEndEntity error? Maybe some setting can be passed to skip the CA vs EE check. (Somehow the Spark-based delta.tables module doesn't seem to be bothered by CA cert used as EE cert.)

ion-elgreco commented 2 months ago

@martroben all storage options are passed to the "object store" crate

martroben commented 2 months ago

Posted an issue/question to object_store repo:

hnasrullakhan commented 2 months ago

Looks like object store bump up has caused this . older versions of deltalake library


this works fine

ion-elgreco commented 2 months ago

@hnasrullakhan please make an issue in arrow-rs repo then, there were zero code changes on our side.

hnasrullakhan commented 2 months ago

there was a bump up in object-store version to 0.9.1

ion-elgreco commented 2 months ago

@hnasrullakhan that's correct, but what I am saying is. this didn't require any changes on our side outside of bumping it. So you should make an issue upstream

martroben commented 1 month ago

Upon further testing, deltalake==0.16.1 works fine, but starting from 0.16.2, I'm getting the error (also tested the latest: 0.17.3).

I think @hnasrullakhan also confirms that - their earlier claim about 0.16.2 working fine was a typo.

I don't see any changes in object_store version between 0.16.1 and 0.16.2 (granted, I don't speak fluent rust).

Does anyone have any ideas, what else could have introduced this error between these two versions?

ion-elgreco commented 3 weeks ago

@martroben can you check against v0.18?

martroben commented 3 weeks ago

@ion-elgreco, I sure can, but on Monday, when I'm back in office.

hnasrullakhan commented 3 weeks ago

what changed on v0.18 @ion-elgreco ?

hnasrullakhan commented 3 weeks ago

Could still repro with v0.18

martroben commented 3 weeks ago

@ion-elgreco, I confirm @hnasrullakhan's position: the same issue still occurs, even with deltalake==0.18.0: error trying to connect: invalid peer certificate: Other(CaUsedAsEndEntity)

As an aside - I'm trying to push a case with MS support in parallel. Their initial position was that since there is no problem with the older versions of deltalake, it's a 3rd party problem.

I suggested that the root cause is still their improper use of certificates - the 3rd parties might have just tightened the rules about what they find acceptable to work with. Not sure if I'll win this argument though, being a mere mortal.

What can men do against such reckless hate? - Théoden, son of Thengel

martroben commented 2 weeks ago

Apparently Polars is not the only downstream library where delta lake interactions broke around deltalake v0.17. The linked issue does not seem to be related to Fabric certificates however.

Nevertheless, for anyone looking, Daft might be a viable alternative to Polars soon - especially if they implement deletes and merges for delta lake.

ion-elgreco commented 2 weeks ago

Apparently Polars is not the only downstream library where delta lake interactions broke around deltalake v0.17. The linked issue does not seem to be related to Fabric certificates however.

Nevertheless, for anyone looking, Daft might be a viable alternative to Polars soon - especially if they implement deletes and merges for delta lake.

That issue is not really related, Daft is using our internal methods for their writer. When we make changes in our internal methods this is not marked as a breaking change :)

martroben commented 2 weeks ago

Thank you for the context @ion-elgreco. In that case it is indeed somewhat unfair for them to cite breaking changes in deltalake, when the issue is at least partly caused by their own misjudgment of what is exposed and what is not.

I'm still trying to understand though, what was the exact change between v0.16.1 and v.0.16.2 that changed the behaviour of SSL connections.

martroben commented 5 days ago

Apparently the problem is no longer present in v0.18.1.

Not sure what caused the fix between 0.18.0 and 0.18.1. If I had to guess, it might be bumping object store from 0.9 to 0.10 where object store updated their reqwest dependency. I guess we'll never know, but I'm nonetheless happy.

Microsoft is still using a self-signed CA certificate as EE certificate in OneLake connections from Fabric. However, I had a call with their support and the product team has apparently promise to do something with the certificate. Not sure, what though. Hopefully it will not break whatever caused the fix.

Josh-Hiz commented 4 days ago

Additionally, I get the similar following error on 0.18.1: OSError: Generic MicrosoftAzure error: Error after 1 retries in 7824.991305s, max_retries:10, retry_timeout:180s, source:error sending request for url ...

When performing a write on a very large data table, should I make a new issue for this? @ion-elgreco