e-mission / e-mission-docs

Repository for docs and issues. If you need help, please file an issue here. Public conversations are better for open source projects than private email.
https://e-mission.readthedocs.io/en/latest
BSD 3-Clause "New" or "Revised" License
15 stars 34 forks source link

The right Server configuration CPU and Memory #807

Open lgharib opened 2 years ago

lgharib commented 2 years ago

We are trying to solve the problem with the server and the pipeline of analysis that broke every time.

What is your server's configuration and how many users are analysed ?

We defined some strategies to test but before jumping into them, we wanted to exchange with you on that subject:

For the moment we’re using a server with 4 GB / 2 vCPUs and 60 Gb of disk, we analyse the data of 9 users out of 278, this is what it looks like:

If you look at the memory MongoDB uses around 60% of the 4 BG of memory. The database size for the total of 278 users profiles (db.Stage_Profiles.find({}).count()) is

DataSize 10,117961496 GB

StorageSize 2,687246336 GB

> db.stats()

{

        "db" : "Stage_database",

        "collections" : 8,

        "views" : 0,

        "objects" : 18794927,

        "avgObjSize" : 538.334705742672,

        "dataSize" : 10117961496,

        "storageSize" : 2687246336,

        "numExtents" : 0,

        "indexes" : 69,

        "indexSize" : 1172217856,

        "fsUsedSize" : 52889665536,

        "fsTotalSize" : 62241562624,

        "ok" : 1
}

The scale defaults to 1 to return size data in bytes

Ref: https://www.mongodb.com/docs/manual/reference/command/dbStats/#command-fields

When the pipeline starts for the 9 users 28.04% of the CPU is used. image

When we try to run the pipeline for all the users using ./e-mission-py.bash bin/intake_multiprocess.py 3 the memory won't exceed 90%, however the CPUs reach their limits at 100% power and MongoDB crashes.

The pipeline can not longer talk to the database and it breaks.

(emission) root@e-mission-server-fabmob-qc:~/e-mission-server# journalctl -f -u mongodb.service
Oct 06 09:47:52 e-mission-server-fabmob-qc systemd[1]: mongodb.service: Main process exited, code=killed, status=9/KILL
Oct 06 09:47:52 e-mission-server-fabmob-qc systemd[1]: mongodb.service: Failed with result 'signal'.

From e-mission logs :

(emission) root@e-mission-server-fabmob-qc:~/e-mission-server# tail -f /var/log/intake.stdinout

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/root/miniconda-4.8.3/envs/emission/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap

    self.run()

...

pymongo.errors.ServerSelectionTimeoutError: 127.0.0.1:27017: [Errno 111] Connection refused, Timeout: 30s, Topology Description: <TopologyDescription id: 633ea2c11db8245d5b388304, topology_type: Single, servers: [<ServerDescription ('127.0.0.1', 27017) server_type: Unknown, rtt: None, error=AutoReconnect('127.0.0.1:27017: [Errno 111] Connection refused')>]>

image

Testing on my local machine:

On another hand I have tried to execute the pipeline on my local machine witch is an AMD® Ryzen 9 3900x 12-core processor × 24 with 32.0 GiB Memory and the machine started to be very laggy but after few hours of pipeline execution the process finished SUCCESSFULLY. I uploaded back the dump result on the server but the pipeline still need more CPU power (more than 2 CPUs) to execute and lead to the previous described situation.

The strategies we defined to fix these issue: We want to delete the old data for the users that are not using the app anymore, before that, we’ll convert them and put those on a dashboard, we suppose that will liberate some space and increase the number of users that we analyse data at the same time, like the new users who will subscribe in the future. Increase the capacity of the server we’re using: what are the characteristics for the server you’re using for open path ? Does it work without any issues? Not sure about that one: analyse just the new data recorded for a user, example: I’m a user, my data are analysed each hour, for example if I don’t travel the next hour, it will be no analysis for me. The analysis will be just for the users who have travelled.

What are your recommendations?

lgharib commented 2 years ago

Since the MongoDB process CPU consumption is important during the pipeline analysis an another approach would be to host it on a dedicated service that can handle the load and scale depending on the demand. In the bellow screenshot of the htop application we can see that MongoDB uses 66.7% of the CPU where as the E-mission-server API uses 19.8% and the Pipeline script 3.2%. image

shankari commented 2 years ago

First, some clarifications:

Not sure about that one: analyse just the new data recorded for a user, example: I’m a user, my data are analysed each hour, for example if I don’t travel the next hour, it will be no analysis for me. The analysis will be just for the users who have travelled.

This is what currently happens. If you think this is not happening, please look at the logs or at the edb.get_pipeline_state_db() collection. For each stage of the pipeline, we track how far we have processed it, and we only process any new data.

Increase the capacity of the server we’re using: what are the characteristics for the server you’re using for open path ? Does it work without any issues?

I'm currently using a t3.2xlarge instance for the CanBikeCo data collection; ~ 150 users total, ~ 100 active users, 1.5 years of data total. Note that the users are split into multiple separate mongo containers, but all the containers run in the same instance. And the webapp and intake pipeline are also running on the same instance.

On another hand I have tried to execute the pipeline on my local machine witch is an AMD® Ryzen 9 3900x 12-core processor × 24 with 32.0 GiB Memory and the machine started to be very laggy but after few hours of pipeline execution the process finished SUCCESSFULLY. I uploaded back the dump result on the server but the pipeline still need more CPU power (more than 2 CPUs) to execute and lead to the previous described situation.

How did you load back the dump? What do the pipeline logs show on this run?

When we try to run the pipeline for all the users using

How are you running for the 9 users? How many users in parallel? ./e-mission-py.bash bin/intake_multiprocess.py 3 runs three users in parallel at one time. Are they the same 9 users every time? Note that, because the pipeline is incremental, running periodically will result in much lower CPU/mem consumption than a user for whom you have never run the pipeline.

Again, the clues for trying to debug the pipeline are in the detailed pipeline logs (stored by default at /var/tmp/intake_*)

shankari commented 2 years ago

I also want to highlight two other aspects: