apache / beam

Apache Beam is a unified programming model for Batch and Streaming data processing.
https://beam.apache.org/
Apache License 2.0
7.88k stars 4.27k forks source link

[Bug]: python flink runner is not compatible with Azure blob file system in Java #24537

Open max-lepikhin opened 1 year ago

max-lepikhin commented 1 year ago

What happened?

  1. Python flink_runner starts beam flink job-server with these parameters:
    def java_arguments(
      self, job_port, artifact_port, expansion_port, artifacts_dir):
    return [
        '--flink-master',
        self._master_url,
        '--artifacts-dir',
        artifacts_dir,
        '--job-port',
        job_port,
        '--artifact-port',
        artifact_port,
        '--expansion-port',
        expansion_port
    ]
  2. When artifacts_dir above is path to a container in Azure blob storage (e.g. "azfs://storage-account/container), the AzureBlobStoreFileSystem attempts to create BlobServiceClient and fails with:
    
    I1206 02:34:29.021346 139656344459008 subprocess_server.py:126] Caused by: java.lang.IllegalArgumentException: Invalid URL format. URL: null
    I1206 02:34:29.021379 139656344459008 subprocess_server.py:126]     at com.azure.storage.blob.BlobUrlParts.parse(BlobUrlParts.java:349)
    I1206 02:34:29.021412 139656344459008 subprocess_server.py:126]     at com.azure.storage.blob.implementation.util.BuilderHelper.httpsValidation(BuilderHelper.java:170)
    I1206 02:34:29.021589 139656344459008 subprocess_server.py:126]     at com.azure.storage.blob.implementation.util.BuilderHelper.buildPipeline(BuilderHelper.java:106)
    I1206 02:34:29.024176 139656344459008 subprocess_server.py:126]     at com.azure.storage.blob.BlobServiceClientBuilder.buildAsyncClient(BlobServiceClientBuilder.java:113)
    I1206 02:34:29.024224 139656344459008 subprocess_server.py:126]     at com.azure.storage.blob.BlobServiceClientBuilder.buildClient(BlobServiceClientBuilder.java:89)
    I1206 02:34:29.024258 139656344459008 subprocess_server.py:126]     at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.base.Suppliers$NonSerializableMemoizingSupplier.get(Suppliers.java:167)
    I1206 02:34:29.024288 139656344459008 subprocess_server.py:126]     at org.apache.beam.sdk.io.azure.blobstore.AzureBlobStoreFileSystem.create(AzureBlobStoreFileSystem.java:271)
    I1206 02:34:29.024316 139656344459008 subprocess_server.py:126]     at org.apache.beam.sdk.io.azure.blobstore.AzureBlobStoreFileSystem.create(AzureBlobStoreFileSystem.java:68)
    I1206 02:34:29.024344 139656344459008 subprocess_server.py:126]     at org.apache.beam.sdk.io.FileSystems.create(FileSystems.java:243)
    I1206 02:34:29.024372 139656344459008 subprocess_server.py:126]     at org.apache.beam.sdk.io.FileSystems.create(FileSystems.java:230)
    I1206 02:34:29.024399 139656344459008 subprocess_server.py:126]     at org.apache.beam.runners.fnexecution.artifact.ArtifactStagingService$ArtifactDestination.fromFile(ArtifactStagingService.java:140)


The null reference exception occurs because the endpoint to blob service is not configured. It is expected to be passed via "blob_service_endpoint" [option](https://github.com/apache/beam/blob/243128a8fc52798e1b58b0cf1a271d95ee7aa241/sdks/java/io/azure/src/main/java/org/apache/beam/sdk/io/azure/options/BlobstoreOptions.java#L85). Neither the option is passed from flink_runner.py nor it is inferred from the "azfs://storage-account/container-name" file path in AzureBlobStoreFileSystem.java.

### Issue Priority

Priority: 1

### Issue Component

Component: beam-community
kennknowles commented 1 year ago

@chamikaramj didn't we just see another report of this? We can dupe this and give the best advice for users to have access to Azure blob store in the artifact staging for portable runners.

chamikaramj commented 1 year ago

Is this a bug or a feature request ? I'm not sure if we ever supported using arbitrary file-systems as the "artifacts-dir".

max-lepikhin commented 1 year ago

Please treat it either way. We didn't find a way to use python beam with flink on azure and other clouds is not an option for other reasons

On Tue, Dec 13, 2022, 5:57 PM Chamikara Jayalath @.***> wrote:

Is this a bug or a feature request ? I'm not sure if we ever supported using arbitrary file-systems as the "artifacts-dir".

— Reply to this email directly, view it on GitHub https://github.com/apache/beam/issues/24537#issuecomment-1350193934, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2QA5NNLDLWHVWST5MSVU4DWNELRDANCNFSM6AAAAAASU7Y6C4 . You are receiving this because you authored the thread.Message ID: @.***>