cubewise-code / rushti

Smooth parallelization of TI Processes with TM1py
https://code.cubewise.com/tm1py-help-content/run-processes-in-parallel-using-only-connection
MIT License
9 stars 13 forks source link

RushTI Drop off in Performance #76

Open pdvanderberg opened 11 months ago

pdvanderberg commented 11 months ago

We are using RushTI to do multithreading run on a process for 12 500 elements, this one process kicks off 7 processes for each of the elements.

We have noticed that after the first few hundred elements the processes starts taking longer dropping from 25 seconds at the start to 2 - 3 minutes by the time we get to articles around the 6000 mark with further drop offs from there. We have now added code into this process that we are calling to keep track of each of the 7 processes and their run time and notice there is no difference in the run time of the sub processes that matches up with this. Looking at the time when the epilog is processed and the time that it is logged in the RushTI log that it is completed, we have found the time difference that we are seeing.

This behavior has been noticed on 2 different servers, current prod server with 48 cores running it at 35 and 40 threads and our new prod server being brought online with 188 cores and running it at 70 threads. Initially believed the issue is related to age of the current prod server and it's capacity.

Trying to figure out where the issue is originating from and if I should create a ticket with IBM due to it maybe being a PA issue in not able to manage the process load.

Version

Article 80384236 Process log, there is 112 second difference between last entry and entry in RushTI log below

Element 80384236 process log

Article 80414291 Process log, there is 5 second difference between last entry and entry in RushTI log below

Element 80414291

Process log with 2 articles and their end times highlighted. Difference between 2 article run time is 2 minutes running at the exact same time

RushTI Log

MariusWirtz commented 10 months ago

Issue moved to RushTI

I want to make sure I understand this correctly. The same TI process is executed ~12'000 times with different parameters Each process takes roughly 25 seconds in the beginning, and towards the end, each process takes up to 2-3 minutes. Correct?

The "elapsed time" per process that you see in the RushTI logs is in line with the elapsed time you find in the TM1 server log, right?

What if you turn the order of the 12'000 elements around? Do you see the same behavior?

Do you observe anything "unnatural" in the TM1 Server or in the TM1 logs while the executions are happening (e.g. unexpected logins or user-sessions, locking or blocking) ?

I don't think it's a TM1py or RushTI issue TBH. I couldn't think of anything that RushTI/TM1py could do that would impact TM1's processing performance so drastically.

pdvanderberg commented 10 months ago

Hi Marius.

On your first question correct it drops off from 25 seconds to 2-3 minutes, I have noticed there are times where it takes 10 minutes. Based on timestamps I have added to the process the delay comes between the epilog finishing and the processes closing on the TM1 server.

I have tested this with multiple of our processes that we are using RushTI with and as soon as you increase the number of repetitions it happens. From a log perspective times matches up exactly between RushTI and the TM1 server logs. Nothing funny was happening on the server, there was only 2 of us that had access at that stage when doing the testing.

Based on my discussions with IBM is that this is the expected result, went through a 2 month processes trying to get better answer than that from them. Even if you run it with RunProcess you get the exact same drop off, too time to built them a process and data set on their demo data to replicate it. IBM's solution is running it in bigger chunks of data, completed testing on it based on their advice and even with the drop off experienced it is still quicker running it with RushTI at an article level.

It doesn't impact business users as they normally run the process for no more than a 100 articles, but it does impact our big system runs and making it difficult to move from ad-hoc running outside of business hours to ensure it completes to be able to run it more dynamically.