GoogleCloudPlatform / dataproc-pubsub-spark-streaming

Apache License 2.0
31 stars 27 forks source link

no supported spark driver for datastore #3

Open reactivedev opened 6 years ago

reactivedev commented 6 years ago

Dear google,

I have been a GCP user for the past 6 months and I would like to take this opportunity to report my agony. PLEASE DO NOT FOOL DEVELOPERS WITH FALSE EXAMPLES! Google doesn't provide supported spark driver for neither pubsub not datastore. Its a shame. Even worse is the following lines of code:

def saveRDDtoDataStore(tags: Array[Popularity], windowLength: Int): Unit

Please read the function name "saveRDD", and you are accepting an array. This is called cheating.

Even worse:

sortedHashtags.foreachRDD(rdd => {
    handler(rdd.take(n)) //take top N hashtags and save to external source
})

Do you know the consequences of using take? Are you a spark developer?

I had to go great lengths to ensure I don't Ack (pubsub) before I process my records. I had to resort to sub-optimal plan-B (broadcast variables) when datastore driver didn't support stream-join.

Its a fact that you want to capture your big-client by forcing them to use propitiatory software like grpc, cloud-data flow by not providing proper drivers for spark. Why beat around the bush?

What a shame! remember "DON'T BE EVIL?" This is evil.

theacodes commented 6 years ago

@jphalip, @texasmichelle, @holdenk can you take a look at this bug report?