OryxProject / oryx

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
http://oryx.io
Apache License 2.0
1.79k stars 405 forks source link

Customer ID update #339

Closed cimox closed 3 years ago

cimox commented 7 years ago

I would love to have possibility to update customer ID via REST API. What do you think, is it possible to implement? If so, could you point me where to start?

Reason why I want to have this is that we're merging some customers over time. So in current setup one customer can be in the model recorded more times. A lot better it would be to know it's the same customer ID.

srowen commented 7 years ago

Do you mean merge IDs? That's a common need, yeah, though it ends up being hard to implement. The model doesn't know about the individual users' data, so can't really delete the old data and add the new data.

The batch layer would have this data and could be told to merge IDs. It can't rewrite old data, but could retain some rewrite rules.

Then it becomes necessary to define some channel outside the normal data path to supply these mappings and maintain them. And there are questions about how that data gets updated from the API, and how it's aged, and how multiple remapping are supported.

Because of the moderate complexity, I have historically said it's up to the app layer to manage this. If user X merges with Y, then leave old user X (it will age out eventually) and post all of X's data to Y, or as much as you know.

I'm open to better ideas.

cimox commented 7 years ago

Yes. I mean merge IDs.

Make perfectly sense everything you wrote there. I thought initially it will be easy to implement. Now I am thinking it will be easier to remember some lookup table in application layer and do it the way you described as posting all X's data to Y.