OryxProject / oryx

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
http://oryx.io
Apache License 2.0
1.79k stars 405 forks source link

Trouble handling huge data #351

Closed stiv-yakovenko closed 6 years ago

stiv-yakovenko commented 6 years ago

After uploading huge data (4G) two days ago, oryx doesn't accept new data (even small data) any more.

image

image

It is not able to find user u0 in the database. I see no crashes in logs, java process is calculating something, this is tail of oryx-batch:

image

ep 11 11:31:51 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:51 INFO  TaskSetManager:54 - Starting task 23.0 in stage 6689.0
Sep 11 11:31:51 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:51 INFO  TaskSetManager:54 - Finished task 22.0 in stage 6689.0
Sep 11 11:31:52 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:52 INFO  TaskSetManager:54 - Finished task 23.0 in stage 6689.0
Sep 11 11:31:52 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:52 INFO  YarnScheduler:54 - Removed TaskSet 6689.0, whose tasks
Sep 11 11:31:52 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:52 INFO  DAGScheduler:54 - ResultStage 6689 (sortBy at ALSUpdat
Sep 11 11:31:52 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:52 INFO  DAGScheduler:54 - Job 5053 finished: sortBy at ALSUpda
Sep 11 11:31:52 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:52 INFO  SparkContext:54 - Starting job: zipWithIndex at ALSUpd
Sep 11 11:31:52 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:52 INFO  DAGScheduler:54 - Registering RDD 4254 (sortBy at ALSU
Sep 11 11:31:52 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:52 INFO  DAGScheduler:54 - Got job 5054 (zipWithIndex at ALSUpd
Sep 11 11:31:52 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:52 INFO  DAGScheduler:54 - Final stage: ResultStage 6692 (zipWi
Sep 11 11:31:52 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:52 INFO  DAGScheduler:54 - Parents of final stage: List(Shuffle
Sep 11 11:31:52 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:52 INFO  DAGScheduler:54 - Missing parents: List(ShuffleMapStag
Sep 11 11:31:52 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:52 INFO  DAGScheduler:54 - Submitting ShuffleMapStage 6691 (Map
Sep 11 11:31:52 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:52 INFO  MemoryStore:54 - Block broadcast_5382 stored as values
Sep 11 11:31:52 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:52 INFO  MemoryStore:54 - Block broadcast_5382_piece0 stored as
Sep 11 11:31:52 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:52 INFO  BlockManagerInfo:54 - Added broadcast_5382_piece0 in m
Sep 11 11:31:52 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:52 INFO  SparkContext:54 - Created broadcast 5382 from broadcas
Sep 11 11:31:52 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:52 INFO  DAGScheduler:54 - Submitting 24 missing tasks from Shu
Sep 11 11:31:52 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:52 INFO  YarnScheduler:54 - Adding task set 6691.0 with 24 task
Sep 11 11:31:52 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:52 INFO  TaskSetManager:54 - Starting task 0.0 in stage 6691.0
Sep 11 11:31:52 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:52 INFO  BlockManagerInfo:54 - Added broadcast_5382_piece0 in m
Sep 11 11:31:52 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:52 INFO  TaskSetManager:54 - Starting task 1.0 in stage 6691.0
Sep 11 11:31:52 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:52 INFO  TaskSetManager:54 - Finished task 0.0 in stage 6691.0
Sep 11 11:31:52 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:52 INFO  TaskSetManager:54 - Starting task 2.0 in stage 6691.0
Sep 11 11:31:52 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:52 INFO  TaskSetManager:54 - Finished task 1.0 in stage 6691.0
Sep 11 11:31:53 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:53 INFO  TaskSetManager:54 - Starting task 3.0 in stage 6691.0
Sep 11 11:31:53 htm-psycho-401.zxz.su bash[5041]: 2018-09-11 11:31:53 INFO  TaskSetManager:54 - Finished task 2.0 in stage 6691.0

This is tail of oryx-speed:

Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  ContextCleaner:54 - Cleaned accumulator 59497
Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  ContextCleaner:54 - Cleaned accumulator 59498
Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  ContextCleaner:54 - Cleaned accumulator 59480
Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  ContextCleaner:54 - Cleaned accumulator 59492
Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  ContextCleaner:54 - Cleaned accumulator 59479
Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  ContextCleaner:54 - Cleaned accumulator 59520
Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  ContextCleaner:54 - Cleaned accumulator 59509
Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  ContextCleaner:54 - Cleaned accumulator 59522
Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  ContextCleaner:54 - Cleaned accumulator 59518
Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  ContextCleaner:54 - Cleaned accumulator 59493
Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  ContextCleaner:54 - Cleaned accumulator 59503
Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  BlockManagerInfo:54 - Removed broadcast_2379_piece0 on
Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  BlockManagerInfo:54 - Removed broadcast_2379_piece0 on
Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  ContextCleaner:54 - Cleaned accumulator 59495
Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  ContextCleaner:54 - Cleaned accumulator 59502
Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  ContextCleaner:54 - Cleaned accumulator 59481
Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  BlockManagerInfo:54 - Removed broadcast_2380_piece0 on
Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  BlockManagerInfo:54 - Removed broadcast_2380_piece0 on
Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  ContextCleaner:54 - Cleaned accumulator 59511
Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  ContextCleaner:54 - Cleaned accumulator 59500
Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  ContextCleaner:54 - Cleaned accumulator 59507
Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  ContextCleaner:54 - Cleaned accumulator 59517
Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  ContextCleaner:54 - Cleaned accumulator 59496
Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  ContextCleaner:54 - Cleaned accumulator 59505
Sep 11 11:30:03 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:03 INFO  ContextCleaner:54 - Cleaned accumulator 59483
Sep 11 11:30:14 htm-psycho-401.zxz.su bash[6157]: 2018-09-11 11:30:14 INFO  ALSSpeedModelManager:97 - ALSSpeedModel[features:10, i

Can oryx work incrementally? I.e. I am /ingesting data block by block and expect recommendations to update. Why is not u0 added immediately (waiting 300seconds doesn't help)?

stiv-yakovenko commented 6 years ago

Clarification: after waiting 10 minutes more I do get responce from oryx, but its irrelevant:

image

From input data we can see that it is supposed to include i1 and i2, but it doesn't. The server is still calculating something. How long should it take for 100K entries in the database?

stiv-yakovenko commented 6 years ago

One more update, finally computation on server seems to have finished (no CPU usage), but recommendations are still irrelevant:

image

srowen commented 6 years ago

I don't see the problem in the ingest. The HTTP status is 204, which is correct.

You wouldn't expect to see a new user in that data return recommendations immediately; it would appear after the speed layer had run and pushed updates to the model. There are also corner cases where no recommendations can be made (i.e. items are unknown too).

It looks like the update jobs are still running or something. I'm not sure how much data you have or what you're running on, but if you're trying to put all of this in one small VM it might take a long time. You need to look at the Spark UI to answer those questions.

If a user has interacted with items, they are excluded from recommendations. Right? that's expected.

I can't say whether the recommendations are relevant or not. They are according to the model. You may need to tune the model. For example if rank is too low it will underfit.