eiffel-community / eiffel-intelligence

Eiffel Intelligence is a real time data aggregation and analysis solution for Eiffel events.
Apache License 2.0
11 stars 72 forks source link

Weaker performance on newer releases #534

Closed domis322 closed 3 months ago

domis322 commented 2 years ago

Description

Hi, we have been using version 3.0.0 for quite a long time and decided to try out upgrading to version 3.2.4. It seemed to be all good but the performance is much weaker. It seems to be around 4-5 times slower than 3.0.0. I saw that this could have been because of the change to only use one thread pool and configuration parameters don't seem to help much. I looks like I have tried everything at this point..

I am not saying that it should be brought back to a previous implementations but the current one isnt good either

Motivation

The performance is quite bad and because of that we have to stay on version 3.0.0

Benefits

faster processing and possibly more areas it could be used in

Possible Drawbacks

possibly using more threads in total

m-linner-ericsson commented 2 years ago

Hi @domis322. In #475 the processing was changed due to problems with unlimited creating of thread. I hope you are aware of you can change the amount of threads that EI should use via application properties (see https://github.com/eiffel-community/eiffel-intelligence/blob/master/src/main/resources/application.properties#L80). You could try to increase these setting to see if the performance improves.

e-backmark-ericsson commented 2 years ago

@domis322 , out of curiosity, could you share what organization you work for? It would be interesting to hear about your use case for using Eiffel Intelligence

domis322 commented 2 years ago

@m-linner-ericsson yes, I have experimented with them quite a bit but they do not seem to affect the performance that much, even when turned up to insane numbers it still looks like its limited on something. The number of pids is increasing but its not processing any faster.

RajuBeemreddy1 commented 2 years ago

Hi @domis322 , Could you please provide more details about the previous version that you have used. In the description, you said that you observed the single thread pool in latest version(3.2.4) but the earlier version(3.0.0) is also using single thread pool.

We have observed not that much deviation in performance between the two versions(2.2.4) and 3.2.4. Could you provide the below parameter values that you have used in your application where you seen the performance deviation. threads.core.pool.size threads.queue.capacity threads.max.pool.size

domis322 commented 2 years ago

@RajuBeemreddy1 i thought this was only changed in version 3.1.0 as per release notes.

image

Just noticed that the docker image that we are using(below) seems to be even older than some 2.x.x versions. I guess the images in dockerhub have been a bit abandoned but this is the one that we're using. I believe it was still has the older threading implementation

image

Tested both versions with these settings at first, since this is what we were using:

threads.core.pool.size=200 threads.queue.capacity=7000 threads.max.pool.size=250 scheduled.threadpool.size=200

After noticing that newer version was slower I tried increasing everything proportionally up to 10 times but that didn't seem increase the performance

jainadc9 commented 1 year ago

Hi @domis322 ,

With below configuration threads.core.pool.size: 100 threads.queue.capacity: 5000 threads.max.pool.size: 150 scheduled.threadpool.size: 100

I have checked with both versions 3.0.0 and 3.2.4 and subscribed with same subscriptions used in the mail with following scenarios and performance was same for both the versions:

Regards, Jainad

e-backmark-ericsson commented 1 year ago

@domis322 , Jainad has performed the tests as you can see above but has not been able to reproduce the loss of performance. Do you think there could be some other settings that you've done that could cause these issues? Are there some other environment settings we should take into considerations? Could there be network issues?

domis322 commented 1 year ago

Hey Thanks for your replies. I would say that networking is very unlikely here to be a problem, since they are all running on the same node in our test server. I think we have a mostly default setup on other environment settings.

Is there a way to verify that the application is picking up an environment file?

m-linner-ericsson commented 1 year ago

Maybe a stupid question: Which aggregation rule are we talking about (just making sure that we are talking about the same)?

jainadc9 commented 1 year ago

Hey Thanks for your replies. I would say that networking is very unlikely here to be a problem, since they are all running on the same node in our test server. I think we have a mostly default setup on other environment settings.

Is there a way to verify that the application is picking up an environment file?

Hi @domis322 ,We meant if your environment / setup is the cause .For example network / firewall issues that could have resulted in the degrade of performance .

jainadc9 commented 1 year ago

Maybe a stupid question: Which aggregation rule are we talking about (just making sure that we are talking about the same)?

As shared in mail AllEventRules is the ruleset and subscription is for the condition meta.type == SourceChangeCreatedEvent

m-linner-ericsson commented 1 year ago

Maybe a stupid question: Which aggregation rule are we talking about (just making sure that we are talking about the same)?

As shared in mail AllEventRules is the ruleset and subscription is for the condition meta.type == SourceChangeCreatedEvent

Is it possible to provide some more information in this ticket rather then referring to a mail? Traceability would improve with added information (maybe not for now but for future references)

jainadc9 commented 1 year ago

Re _Minutes_of_meeting;_Discussion_of_Eiffel_issue_534.zip Uploaded mailed information

m-linner-ericsson commented 1 year ago

@jainadc9 Are you aware that some of the files are empty?

domis322 commented 1 year ago

Hey Thanks for your replies. I would say that networking is very unlikely here to be a problem, since they are all running on the same node in our test server. I think we have a mostly default setup on other environment settings. Is there a way to verify that the application is picking up an environment file?

Hi @domis322 ,We meant if your environment / setup is the cause .For example network / firewall issues that could have resulted in the degrade of performance .

the environment was the exact same for both of the versions, in fact I tested them both on the exact same server with the rabbitmq and ei being in separate docker containers within the same machine. Ill will try to go back and look into it a bit more, the only thing that I could think of why it would do something different is that it might not be picking up the environment file or maybe some of them have changed?

jainadc9 commented 1 year ago

No @domis322 theres nothing changed we are following the same rules and subscriptions and events referred in wiki. https://github.com/eiffel-community/eiffel-intelligence/blob/master/wiki/templates.md and its the default application properties that we have used and no particular environment file /setup is being used

e-backmark-ericsson commented 1 year ago

@jainadc9 / @domis322 , are any of you looking in to this issue at the moment? What do you expect should happen next to make it progress?

domis322 commented 1 year ago

I tried rerunning all of the tests I have done before and more and it still looks like it has the same performance issue. I don't really see what could be changed at this moment. Could you specify how you generated the events and what consumption numbers you are getting?

version 3.0.0: image

version 3.2.4: (second half of the timeframe) image

z-sztrom commented 1 year ago

I tried to reproduce the issue. I compared version 3.0.0 with 3.2.6 and I haven't observed any performance degradation. I observed opposite result: version 3.2.6 was about 4% faster comparing with 3.0.0.

e-backmark-ericsson commented 1 year ago

This issue has been around for too long now without reaching any conclusion through its comments. Should we book a meeting to try to sort out any differences in environment setup or event generation or the like, so we could get to the bottom of this? Would that be ok with you @z-sztrom and @domis322?

z-sztrom commented 1 year ago

@e-backmark-ericsson, yes, let's meet and discuss why the results differ so significantly.

domis322 commented 1 year ago

Yeah, we can definitely do that!

e-backmark-ericsson commented 1 year ago

Notes from today's meeting.

AllEvents ruleset is used @domis322 uses dedicated Docker images for each version, while @z-sztrom uses the same image but replaces the war file

Actions:

e-backmark-ericsson commented 1 year ago

This is the Dockerfile that has been used to create the latest EI images that are pushed to DockerHub: https://github.com/eiffel-community/eiffel-intelligence/blob/master/src/main/docker/Dockerfile That image is based on Tomcat. Earlier the Docker image was built with a Java image as base image: https://github.com/eiffel-community/eiffel-intelligence/commit/7954f76fdc2d10b9f9386e262af52a30f03c0b6f

EI Backend image 2.0.1 has Tomcat in it: https://hub.docker.com/layers/eiffelericsson/eiffel-intelligence-backend/2.0.1/images/sha256-cc4121c070776d5901079cda9b2e9fa557430caed33c346050a7196753a12dec?context=explore and up until 0.0.18 it was based on Java: https://hub.docker.com/layers/eiffelericsson/eiffel-intelligence-backend/0.0.18/images/sha256-8af6c773e708f3fd85686cda1e29e9a311827912296dfbc2b96ba2748483810a?context=explore

tobiasake commented 1 year ago

Just one input if someone want to test with same code as in earlier versions in the newer version, someone could tryout add back the unlimited threading in the subscriptionHandler code and run test if that solves the performance issue, the code that need to be resotred can be found in this commit diff: https://github.com/eiffel-community/eiffel-intelligence/commit/0937cc28f728e40e8acd374022e9ccb658c33d8a#diff-67101393a8229c6b8e5741ec64e1254bc2ca652f9262a805c5148d37cbc50c54

Check also the Thread pool size properties values that was use in previous Eiffel-Intelligence versions.

But increasing the thread core pool and queue size properties should result in same behavior almost, I think. According to the java docs, setting the thread queue size property to zero will result in a unlimited threadpool queue, so that could be tested as well.

Just some more ideas that can be tested, but what I wrote here maybe have been tested already?

domis322 commented 1 year ago

Hi,

So I tried building the images myself on both versions and running them. I tried running the existing image and tried running an image after injecting it with a war from a new version. In all cases the newer version seems to have worse performance.

@tobiasake I tried the values from the mentioned commit. both before the change and after, they both seem to have similar performance in this case (so not that good). I tried increasing the values to over 10 000 which does seem to be spawning more threads (PIDS in picture bellow). But it doesnt help to consume the events faster.

image

setting thread queue size to zero did not help either.

I think there might be a bottleneck outside the multithreaded part. maybe because in the new verseion getAllDocuments() is called on the main thread?

image
z-sztrom commented 3 months ago

@domis322, please, provide your test results when using completely empty mongoDB.

domis322 commented 3 months ago

I recently tried out running both versions connected to a completely empty MongoDB. This seems to have fixed the issue completely. Both of the versions now have the same performance. I couldn't figure exactly what in the database caused the performance issues for the newer version but it has now been running smoothly for over a month. I think the issue can therefore be closed.

e-backmark-ericsson commented 3 months ago

Thanks! Closing the issue.