dragonflydb / dragonfly

A modern replacement for Redis and Memcached
https://www.dragonflydb.io/
Other
25.79k stars 950 forks source link

Duplicate fields in hashmap #2066

Closed fernandomacho closed 1 year ago

fernandomacho commented 1 year ago

I have a 3 nodes running dragonfly 1 master and 2 replicas. The access to dragonfly instances is over haproxy with 2 backend: 1 for write (master) and 1 for read (master/replica)

The kernel version on servers are 5.15.0-73-generic on all servers.

The hardware is: Master: 96 cores and 376 Gb ram Replica 1: 48 cores and 256 Gb ram Replica 1: 48 cores and 256 Gb ram

The load of clients is:

On Master:

Clients

connected_clients:402 client_read_buffer_bytes:14292480 blocked_clients:0 dispatch_queue_entries:0

On replica 1:

Clients

connected_clients:188 client_read_buffer_bytes:52480 blocked_clients:0 dispatch_queue_entries:108365

On replica 2: connected_clients:202 client_read_buffer_bytes:54272 blocked_clients:0 dispatch_queue_entries:61381

The replication status is: On Master: role:master connected_slaves:2 slave0:ip=172.16.0.23,port=6392,state=stable_sync,lag=74 slave1:ip=172.16.0.25,port=6392,state=stable_sync,lag=61 master_replid:2c604ff0c72153701e8be30d707aa72d8f9d4ca9

On replica 1: role:replica master_host:172.16.0.24 master_port:6392 master_link_status:up master_last_io_seconds_ago:0 master_sync_in_progress:0

On replica 2: role:replica master_host:172.16.0.24 master_port:6392 master_link_status:up master_last_io_seconds_ago:0 master_sync_in_progress:0

The CPU on master consum is over 20% The CPU on replica 1 consum is over 370% The CPU on replica 1 consum is over 300%

Any log on dragonfly log files.

The nodes run sentinel and show multiple: 1303:X 24 Oct 2023 19:02:26.831 # +sdown slave 172.16.0.23:6392 172.16.0.23 6392 @ ovh-innova 172.16.0.24 6392

I think these errors are due to the high CPU load of the replica nodes

How is it possible that for many more clients the CPU load of the master is much lower than that of the replicas??

I attach info command on all nodes.

Thanks dfly_info.log

romange commented 1 year ago

Thanks for reporting this. Can you please go to replica url: http://replica:6379/profilez?profile=on this will start profiling the server, you should see in the INFO log the path to .prof file. Then, after 30s change the url to http://replica:6379/profilez?profile=off to flush the profiler file. We will need this file to see what the bottlenecks are.

romange commented 1 year ago

if you install pprof on your replica machine before starting the profiling, the browser will render the profiler graph that is also useful.

fernandomacho commented 1 year ago

Hi I try to access this URL, on my case 172.16.0.25:6392/profilez?profile=on over browser and get beautiful image ;)

image

But when go to 172.16.0.25:6392/profilez?profile=off redirect to http://172.16.0.25:6392/filez?file=/tmp/profile/dragonfly_24102023_183917.prof.svg and get this error

image

On /tmp directory not exist the file...

Do you kown on that ubuntu package is pprof ?

romange commented 1 year ago

where dragonfly logs are located? did you install dragonfly from a debian package?

fernandomacho commented 1 year ago

The logs is on /var/log/dragonfly this directory has nothing but dragonfly's log files. If you need me to install pprof, I can't find the ubuntu package where that "utility" is located

romange commented 1 year ago

can you check if /tmp/ has profile directory and whether it has anything?

fernandomacho commented 1 year ago

Yes I check /tmp directory and not contain profile directory I test profile on other replica and the result is the same no profile directory on /tmp

fernandomacho commented 1 year ago

I show on logs:

`I20231024 20:58:14.724368 2454603 profilez_handler.cc:53] Starting profiling into /tmp/profile/dragonfly_24102023_185814.prof 1

I20231024 20:58:48.925976 2456502 profilez_handler.cc:70] Symbols /usr/bin/dragonfly.debug, suffix: _24102023_185814.prof

I20231024 20:58:48.926085 2456502 profilez_handler.cc:84] Running command: nice -n 15 pprof -noinlines -lines -unit ms --svg /usr/bin/dragonfly /tmp/profile/dragonfly_24102023_185814.prof > /tmp/profile/dragonfly_24102023_185814.prof.svg 2> /tmp/profile/dragonfly_24102023_185814.prof.err

I20231024 20:58:48.929224 2456502 profilez_handler.cc:95] Redirecting to filez?file=/tmp/profile/dragonfly_24102023_185814.prof.svg

I20231024 21:00:18.047973 4010978 save_stages_controller.cc:245] Saving "/var/lib/dragonfly/dump-2023-10-24T21:00:10-summary.dfs" finished after 7.71 s`

But /tmp/profile not exits

I try to create manually /tmp/profile and change permissions to 777 an re run the profile. The result is the same nothing write on /tmp/profile

fernandomacho commented 1 year ago

I install pprof go command ;)

When run profile show on log

I20231024 21:19:56.859346 2576396 profilez_handler.cc:53] Starting profiling into /tmp/profile/dragonfly_24102023_191956.prof 1

but the file is not created

UkuLoskit commented 1 year ago

I think the directory that you are checking as /tmp is not the directory that is /tmp for Dragonfly. In your example, you navigated to the svg file which gave an error that the file was empty (and thus an invalid SVG). This happens when pprof is not installed, because the command pipeline will create an empty file on error. Therefore /tmp/profile... exists. You still did not mention how you are running Dragonfly. Is it perhaps within a container?

fernandomacho commented 1 year ago

nop, I run dragonfly on bare metal

romange commented 1 year ago

our systemd has private tmp folder. please do the following:

dfpid=$(pidof dragonfly)
sudo nsenter -t $dfpid -m ls /tmp
fernandomacho commented 1 year ago

ok, I show the files, how copy this files?

image
UkuLoskit commented 1 year ago
ls /proc/$(pidof dragonfly)/root/tmp/profile

and then you can copy from the private tmp namespace to the regular /tmp dir

cp /proc/$(pidof dragonfly)/root/tmp/profile/something.prof /tmp/realtmp.prof
fernandomacho commented 1 year ago

Archive.zip

romange commented 1 year ago

Thank you @UkuLoskit - I did not know this.

romange commented 1 year ago

Unfortunately the profiler data is not very useful, or at least looks very, very weird as if it shows wrong function names. what's your ubuntu version and the architecture uname -a and cat /etc/os-release ?

romange commented 1 year ago

Also, can you please provide "info all" output on the master and on the replica?

fernandomacho commented 1 year ago

master Linux ovh-innova 5.15.0-73-generic #80~20.04.1-Ubuntu SMP Wed May 17 14:58:14 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

NAME="Ubuntu" VERSION="20.04.6 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.6 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal


replica 1 Linux innova2 5.15.0-73-generic #80-Ubuntu SMP Mon May 15 15:18:26 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

PRETTY_NAME="Ubuntu 22.04.3 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.3 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy


replica 2 Linux innova2 5.15.0-73-generic #80-Ubuntu SMP Mon May 15 15:18:26 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

PRETTY_NAME="Ubuntu 22.04.3 LTS" NAME="Ubuntu" VERSION_ID="22.04" VERSION="22.04.3 LTS (Jammy Jellyfish)" VERSION_CODENAME=jammy ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=jammy dfly_info.log

fernandomacho commented 1 year ago

I try 1.- Stop dragonfly on replica 1 2.- rm /var/lib/dragonfly/* 3.- start dragonfly on replica 1

Get this error: E20231025 07:30:31.125586 866245 rdb_load.cc:619] Duplicate hash fields detected E20231025 07:30:31.135639 866244 rdb_load.cc:619] Duplicate hash fields detected E20231025 07:30:31.135838 866246 rdb_load.cc:619] Duplicate hash fields detected E20231025 07:30:31.142112 866248 rdb_load.cc:619] Duplicate hash fields detected E20231025 07:30:31.147995 866241 rdb_load.cc:619] Duplicate hash fields detected E20231025 07:30:31.149091 866247 rdb_load.cc:619] Duplicate hash fields detected E20231025 07:30:31.161890 866243 rdb_load.cc:619] Duplicate hash fields detected E20231025 07:30:31.177135 866242 rdb_load.cc:619] Duplicate hash fields detected W20231025 07:30:31.178701 866243 replica.cc:214] Error syncing with 172.16.0.24:6392 dragonfly.rdbload:5 Internal error when loading RDB file 5

romange commented 1 year ago

can you please DM me on discord? will be easier to talk

romange commented 1 year ago

I would like to give you a dev version of dragonfly that will help with understanding both issues you are experiencing. Please tell me if you are interested.

fernandomacho commented 1 year ago

Hello, I think the problem comes from "duplicity of sets". I have stopped the three nodes and tried to start in order. When I tried to start the master I got the duplicate sets error and the only way to start was to delete the saved files. Are you interested in me sending you the database files?

romange commented 1 year ago

yes, very much so

fernandomacho commented 1 year ago

Can I send you the files privately?

romange commented 1 year ago

sure. Can you DM me on discord @fernandomacho ?

fernandomacho commented 1 year ago

I just created an account on discord, I've never used this, from what I see I have to ask you for contact right? Your username on discord is romange?