Open lgharib opened 2 years ago
Since the MongoDB process CPU consumption is important during the pipeline analysis an another approach would be to host it on a dedicated service that can handle the load and scale depending on the demand. In the bellow screenshot of the htop application we can see that MongoDB uses 66.7% of the CPU where as the E-mission-server API uses 19.8% and the Pipeline script 3.2%.
First, some clarifications:
Not sure about that one: analyse just the new data recorded for a user, example: I’m a user, my data are analysed each hour, for example if I don’t travel the next hour, it will be no analysis for me. The analysis will be just for the users who have travelled.
This is what currently happens. If you think this is not happening, please look at the logs or at the edb.get_pipeline_state_db()
collection. For each stage of the pipeline, we track how far we have processed it, and we only process any new data.
Increase the capacity of the server we’re using: what are the characteristics for the server you’re using for open path ? Does it work without any issues?
I'm currently using a t3.2xlarge
instance for the CanBikeCo data collection; ~ 150 users total, ~ 100 active users, 1.5 years of data total. Note that the users are split into multiple separate mongo containers, but all the containers run in the same instance. And the webapp and intake pipeline are also running on the same instance.
On another hand I have tried to execute the pipeline on my local machine witch is an AMD® Ryzen 9 3900x 12-core processor × 24 with 32.0 GiB Memory and the machine started to be very laggy but after few hours of pipeline execution the process finished SUCCESSFULLY. I uploaded back the dump result on the server but the pipeline still need more CPU power (more than 2 CPUs) to execute and lead to the previous described situation.
How did you load back the dump? What do the pipeline logs show on this run?
When we try to run the pipeline for all the users using
How are you running for the 9 users? How many users in parallel? ./e-mission-py.bash bin/intake_multiprocess.py 3
runs three users in parallel at one time. Are they the same 9 users every time? Note that, because the pipeline is incremental, running periodically will result in much lower CPU/mem consumption than a user for whom you have never run the pipeline.
Again, the clues for trying to debug the pipeline are in the detailed pipeline logs (stored by default at /var/tmp/intake_*
)
I also want to highlight two other aspects:
conf/storage/db.conf
). You could try dialing that down, but note that if you make it too low, if you have a long trip on iOS, you will not be able to read all entries for the trip at a time, so will get stuck. I am thinking of bumping up the distance filter on iOS to 5 meters to reduce resource usage on the server, you could try that on your app as well.
We are trying to solve the problem with the server and the pipeline of analysis that broke every time.
What is your server's configuration and how many users are analysed ?
We defined some strategies to test but before jumping into them, we wanted to exchange with you on that subject:
For the moment we’re using a server with 4 GB / 2 vCPUs and 60 Gb of disk, we analyse the data of 9 users out of 278, this is what it looks like:
If you look at the memory MongoDB uses around 60% of the 4 BG of memory. The database size for the total of 278 users profiles (db.Stage_Profiles.find({}).count()) is
DataSize 10,117961496 GB
StorageSize 2,687246336 GB
The scale defaults to 1 to return size data in bytes
Ref: https://www.mongodb.com/docs/manual/reference/command/dbStats/#command-fields
When the pipeline starts for the 9 users 28.04% of the CPU is used.
When we try to run the pipeline for all the users using ./e-mission-py.bash bin/intake_multiprocess.py 3 the memory won't exceed 90%, however the CPUs reach their limits at 100% power and MongoDB crashes.
The pipeline can not longer talk to the database and it breaks.
From e-mission logs :
Testing on my local machine:
On another hand I have tried to execute the pipeline on my local machine witch is an AMD® Ryzen 9 3900x 12-core processor × 24 with 32.0 GiB Memory and the machine started to be very laggy but after few hours of pipeline execution the process finished SUCCESSFULLY. I uploaded back the dump result on the server but the pipeline still need more CPU power (more than 2 CPUs) to execute and lead to the previous described situation.
The strategies we defined to fix these issue: We want to delete the old data for the users that are not using the app anymore, before that, we’ll convert them and put those on a dashboard, we suppose that will liberate some space and increase the number of users that we analyse data at the same time, like the new users who will subscribe in the future. Increase the capacity of the server we’re using: what are the characteristics for the server you’re using for open path ? Does it work without any issues? Not sure about that one: analyse just the new data recorded for a user, example: I’m a user, my data are analysed each hour, for example if I don’t travel the next hour, it will be no analysis for me. The analysis will be just for the users who have travelled.
What are your recommendations?