Open ballooncross opened 1 year ago
Label examples-java cannot be managed because it does not exist in the repo. Please check your spelling.
Label cannot be managed because it does not exist in the repo. Please check your spelling.
Label cannot be managed because it does not exist in the repo. Please check your spelling.
The issue here is that Latest.globally() is a separate triggered global Combine than the global Combine contained in View.asSingleton. Thus if Latest triggers multiple times, the subsequent singleton combine may observe more than 1 element when firing and fail.
See https://www.mail-archive.com/user@beam.apache.org/msg02129.html
A workaround is to use View.asIterable and take the last element of the iterable when consuming the side input. There will be more than 1 element only if Latest triggers multiple times before the side input combine processes the output.
@kennknowles What do you think about adding a new transform View.asLatest() that is logically the same as Latest.globally() + View.asSingleton() but is implemented with a single combine and thus the side input view will always be a single latest value each time it is calculated?
Just skimming this last comment, it sounds like this is related to the impetus to https://s.apache.org/beam-triggered-side-inputs
Right now we have a combination of
I am totally happy with a View.latest()
. In my doc I propose that as the semantics for View.asSingleton
:-)
If you can implement it without runner changes, or have the bandwidth to make the necessary runner changes, I would just replace View.asSingleton
with that. Otherwise, having it as a stopgap until View.asSingleton
can be adjusted is great.
@ballooncross Looking at your code I can suggest an alternative approach by using a custom global combiner that will eliminate subsequent Latest.globally()
and View.asSingleton()
and will do everything in one step.
PCollectionView<Map<String, String>> map = pipeline.apply("Impulse",
GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1)))
.apply(Window.<Map<String, String>>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(Combine.globally(new MaxImpulseFn<>(functionToRead, coderOfFunctionOutput))
.withoutDefaults()
.asSingletonView());
MaxImpulseFn
is a custom combiner (even though quite simple) that finds the max impulse and reads the external data in extractOutput
private static class MaxImpulseFn<T> extends CombineWithContext.CombineFnWithContext<Long, Long, T> {
@Override
public T extractOutput(Long accumulator, Context context) {
return functionToRead.apply(accumulator, context.getPipelineOptions());
}
@Override
public Coder<T> getDefaultOutputCoder(CoderRegistry registry, Coder<Long> inputCoder) {
return coderOfFunctionOutput;
}
}
Hi, is there a plan to continue working on this? We are also trying to use similarly to this issue the SideInputs described in the https://beam.apache.org/documentation/patterns/side-inputs/ Docu on Google Dataflow and getting the same "java.lang.IllegalArgumentException: ...." After some time Dataflow running in the sideInput using transformation. Confusingly it occurs only in our dataflow Joby, in which we use several workers in parallel.
Can someone clarify what the plans to make this usable are?
Kind regards, Ivan Fröhlich SAP SE
@IvanFroehlich are the workarounds working for you?
@liferoad No, thats the issue. we have tried following meanwhile:
v2: replace the latest globally with custom combine function
...
pipeline
.apply(GenerateSequence.from(0)
.withRate(1, Duration.standardSeconds(envConfig.getConfigTtlInSeconds())))
.apply(ParDo.of(doFunction)) //here we read SideInputDestinations(external Data)
.apply(
Window.<SideInputDestinations>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(Combine.globally(new CombineSerializableFunction())
.apply(View.asSingleton());
...
static class CombineSerializableFunction implements
SerializableFunction<Iterable<SideInputDestinations>, SideInputDestinations> {
@Override
public @UnknownKeyFor @Nullable @Initialized SideInputDestinations apply(
Iterable<SideInputDestinations> input) {
SideInputDestinations last = null;
Iterator<SideInputDestinations> iterator = input.iterator();
while (iterator.hasNext()) {
last = iterator.next();
log.info("SideInput: processing iteration for object {}", last.hashCode());
}
return last;
}
}
=> same Exception
v3: replace the View.asSingleton() with asSingletonView():
...
pipeline
.apply(GenerateSequence.from(0)
.withRate(1, Duration.standardSeconds(envConfig.getConfigTtlInSeconds())))
.apply(ParDo.of(doFunction)) //here we read SideInputDestinations(external Data)
.apply(
Window.<SideInputDestinations>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(Combine.globally(new CombineSerializableFunction()).asSingletonView());
...
static class CombineSerializableFunction implements
SerializableFunction<Iterable<SideInputDestinations>, SideInputDestinations> {
@Override
public @UnknownKeyFor @Nullable @Initialized SideInputDestinations apply(
Iterable<SideInputDestinations> input) {
SideInputDestinations last = null;
Iterator<SideInputDestinations> iterator = input.iterator();
while (iterator.hasNext()) {
last = iterator.next();
log.info("SideInput: processing iteration for object {}", last.hashCode());
}
return last;
}
}
=> same Exception in using ParDo:
Caused by: java.lang.IllegalArgumentException: PCollection with more than one element accessed as a singleton view.
Should we try another version?
Kind Regards, Ivan
I believe the workaround is "use View.asIterable and take the last element of the iterable when consuming the side input."
Hi @IvanFroehlich. Have you tried the solution I posted above?
What happened?
I am writing a pipeline to consume message from pubsub, do some validation, transform and sink to bigquery.
I need to load some data from external api call to be used in pipeline validation stage, for which I followed Slowly updating global window side inputs to load config into side input. However, I keep getting error, while I already applied
Latest.globally()
:Going through online resource doesn't really help. However, I found that this happen only when the impulse duration is short, e.g. < 5s: GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1L)))
Is this expected?
This bothers me, because I am not sure if the short period the root cause. Or will the error shows up again if the pipeline traffic becomes large even I have a minute as impulse period in product environment.
Any one have the same issue, or suggestion?
Beam version: 2.46.0
Here are simplied version my code, that can reproduce the error:
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components