LEventStore timeout is caught instead of propagated

jimlyndon commented 7 years ago

Noticed that it's very easy to hit the timeout on hbase/eventstore under load. Obviously we can scale the datastore, but the code should really error out at this point. Instead, the system logs the timeout and returns an empty sequence https://github.com/actionml/template-scala-parallel-universal-recommendation/blob/master/src/main/scala/URAlgorithm.scala#L714

This empty sequence will cause the algorithm to recommend popular recs (if that backfill is turned on) or nothing at all. Is this intentional? I can see how it might seem to be a good idea to cover a timeout with popular recs, but that should really be determined by the client, or at least inform the client that this has happened (so take these recs with caution, etc.).

Otherwise this should be fixed and the system should simply fail appropriately, passing the proper http error code to the client. This will allow the client to avoid using incorrect/incomplete recommendations, falling back to a previously cached result, etc.

Thoughts?

jimlyndon commented 7 years ago

BTW, if in agreement I can put together a PR for it.

pferrel commented 7 years ago

Would love a fix and I agree it seems like it should be a REST/HTTP error passed back to through the PredictionServer.

If you make it against my private repo I can test it. We need to have a way to set the timeout too but maybe that's asking too much. The timeout should be different for training where it is set high and for queries where we give up quicker.

Anyway my private repo is here: https://github.com/pferrel/template-scala-parallel-universal-recommendation.git

actionml / template-scala-parallel-universal-recommendation

LEventStore timeout is caught instead of propagated #28