Open gergely-g opened 6 months ago
The workaround looks good for the Python Direct Runner. @tvalentyn
cc: @chamikaramj
The ResolveArtifacts call will unconditionally download the 300MB beam-sdks-java-extensions-sql-expansion-service-2.55.0.jar to a temporary directory for each ste
The downloaded jars should be cached. Probably this caching doesn't work for your environment ?
You also have the option of manually specifying the jar [1] or manually starting up an expansion service [2].
[1] --beam_services="{\":sdks:java:extensions:sql:expansion-service:shadowJar\": \"$EXPANSION_SERVICE_JAR\"}"
[2] https://beam.apache.org/documentation/sdks/python-multi-language-pipelines/#choose-an-expansion-service
@chamikaramj The cache hit will never be detected for the downloaded JARs because of this line: https://github.com/apache/beam/blob/bb51380f1b29a2b69ab82ef795a8895ebd89f87e/sdks/python/apache_beam/runners/portability/artifact_service.py#L294
It always evaluates to False.
A worse problem though that, as mentioned above ArtifactRetrievalService.ResolveArtifacts()
call takes 1.5s per SQL query even without the downloading of the actual files.
@robertwb can you check this?
Hi any news? I have also encountered this exact same issue
and False
part does seem like a bug but I don't think that actually gets hit since the Java expansion response serves Beam artifacts as DEFERRED artifacts that are retrieved from the locally available expansion service (so URN is DEFERRED not FILE).
Expansion service jar is cached elsewhere when starting up the expansion service and served to Python side using the ArtifactRetrievalService.ResolveArtifacts()
API. This might be adding the O(seconds)
per-query delay you are observing unfortunately.
What happened?
When building a Pipeline with multiple SqlTransforms from Beam Python, the expansion that happens in SqlTransforms is currently (Beam 2.55.0) extremely inefficient.
This inefficiency has multiple sources.
The latter dominates execution time. For example running a Beam from a 4 vCPU, 2 core, 16 GB memory machine (standard Dataflow workbench setup) a Pipeline with 31 trivial SQL transforms takes 200 seconds to execute. (See example below.)
We found a somewhat dirty workaround to speed things up by skipping the
SqlTransform._resolve_artifacts()
altogether when working from inside Jupyter.This brings down the execution speed from 200s to 22s.
I suspect these inefficiencies also contribute to beam_sql being extremely slow even for trivial queries.
apache_beam/runners/portability/artifact_service.py contains this code snippet that might be one of the culprits for this inefficiency (note the
and False
):In addition, once the ExpansionService is cached it only takes 100-200ms to perform the actual SQL expansion, but the
ArtifactRetrievalService.ResolveArtifacts()
call takes 1.5s per SQL query even without the downloading of the actual files. This dominates the expansion time, which dominates the overall time of launching and running a pipeline.So the hotspot call sequence is something like:
SqlTransform.expand()
ExternalTransform.expand()
ArtifactRetrievalService.ResolveArtifacts()
The times may not sound like much, but latency is bad enough to ruin the Jupyter REPL experience when combining Python + SQL.
Code to repro and demonstrate the workaround.
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components