CPChain / chain

Mirror of https://bitbucket.org/cpchain/chain
GNU General Public License v3.0
51 stars 10 forks source link

Can't sync chain anymore on my rnode and I can't see the error message #108

Closed nuht closed 4 years ago

nuht commented 4 years ago

Hi,

So my diskspace was full so it made a lot of errors on my rnode. Couldn't stop it properly so I just created a new server to sync chain again and make it work this time I was still using 0.4.7

I tried to run the latest file version and after around 400k blocks synced I got a big error I couldn't see why it stopped the message is too long, all I see is that : https://imgur.com/XLNJS1g If I launch the run command again it makes the same error again and again every like 50k blocks...

I used the latest cpchain file version. I'm on Ubuntu.

What can I do, I couldn't make my node running again in almost three days since I saw the diskspace was full.

Where does the bug occur? Every time I'm trying to sync chain.

What is your wallet address and your balance? 0x8C02681204080daf5B50A1A567A52d9562b99Ab6

Configuration (please complete the following information):

Hope someone would help me, thanks !

cpchainbot commented 4 years ago

The latest cpchain program can free up your disk, you can try it first

e.g.

./cpchain chain delete dpor- --datadir ~/.cpchain

you can restart rnode after cleanup disk.

The error you got seems a goroutine wait too long. We'll test it.

cpchainbot commented 4 years ago

Sorry,we can't reproduce this bug when we sync chain. If this error still occurs, you can try running the command without extra parameters, just run ./cpchain run to test.

nuht commented 4 years ago

Capture Hi, thanks for answering me. As you can see in me screen shoot I already tried just ./cpchain run command yesterday before sleeping. I woke up today with the same goroutine error I think, I started again it was around block 600k ... I also tried another os yesterday I did a fresh install on CentOs and it ended up the same as ubuntu 18.04.

Now i'm trying to do a fresh install I already deleted the old rnode server, so can I try the ./cpchain chain delete dpor- --datadir ~/.cpchain or not ?

Edit : Started again the run command and it crashed after 160k blocks https://imgur.com/a/4FaawYX

Edit : I did a lot of tries today, I tried another server. I tried to sync with cpchain file 0.5.1. It always sync like the firt 500/600k blocks then it ends the same again...

cpchainbot commented 4 years ago

If you still have old cpchain data, you can try ./cpchain chain delete dpor- --datadir ~/.cpchain.

If you deleted the old server, you can try to recover the instance in your cloud service console. Maybe the cloud disk will be kept for a period of time. If the cloud disk were also deleted, syncing blocks is the only way...

We are adding more unit tests to test the sync module. This should be a bug that occurs when the network is not very well and latency timeout. We are trying to test with a simulated bad network. But sorry, we are not sure how long to solve this bug...

Can you tell me where do you live now? We will deploy a civilian node to offer a highly available connection, this may help...

nuht commented 4 years ago

Unfortunately I also deleted the cloud disk attached to the server so my only way is to do a fresh install.

Today I watched a lot of blocks syncing, I'm not sure about that but I think I saw a line saying 'sync timetout' just before the big error message maybe it could help.

I'm living in France but the server is located at Helsinki I think.

So do I need to still try or just wait till you do something specific for me ? And do I take the latest cpchain file ?

Thanks for trying !

edit: https://imgur.com/a/ppXxdwD (screenshot just before sync stopped) So as I just said, I tried another time and while watching blocks syncing there was another sync timeout error, then it freezes for 1/2 minutes and then the big error message with all go routine popped up.

cpchainbot commented 4 years ago

Take the latest cpchain file. We will deploy a civilian node on Stockholm (AWS) today.

You can run the cpchain process in a docker container with --restart=always. Just like:

sudo docker run --name cpchain --restart=always -d-v ${HOME}/.cpchain:/root/.cpchain liaojl/cpchain:0.5.2 cpchain run

Touch logs

sudo docker logs -f --since 1m cpchain

stop and remove container

sudo docker stop cpchain sudo docker rm cpchain

If the panic occurs, the cpchain process will restart automatically.

nuht commented 4 years ago

I've never used docker since today, so I just installed it. I couldn't use the sudo docker run --name cpchain --restart=always -d-v ${HOME}/.cpchain:/root/.cpchain liaojl/cpchain:0.5.2 cpchain run without making a space between -d and -v parameter, I don't know if it's important or not.

So it seems running with a new error when i look into the logs.

"system clock need to be synchronized.there is more than 10 seconds gap between ntp and this server". But when i use date or 'hwclock' commands i got the got time. Ntp server is not accessible inside docker container ?

cpchainbot commented 4 years ago

make a space between -d and -v

system clock need to be synchronized.there is more than 10 seconds gap between ntp and this server is a fatal error report by cpchain process. So if you can run ./cpchain run, you can run it with docker.

cpchainbot commented 4 years ago

how much RAM of your machine(cpchain process will uses a lot of RAM when syncing)? The cpchain process has a fatal error when OOM. You can also check if there is a fatal error: fatal error: runtime: out of memory.

nuht commented 4 years ago

make a space between -d and -v

system clock need to be synchronized.there is more than 10 seconds gap between ntp and this server is a fatal error report by cpchain process. So if you can run ./cpchain run, you can run it with docker.

So I just retry everything again with a new server with docker and now the ntp error dissapeared, I can see the chain syncing in the log.

Assuming I can sync all the chain using docker, what is the next step to continue syncing block and don't have any problem outside docker to run my node ?

how much RAM of your machine(cpchain process will uses a lot of RAM when syncing)? The cpchain process has a fatal error when OOM. You can also check if there is a fatal error: fatal error: runtime: out of memory.

I also checked the ram usage of the cpchain processus, it use around 60%, it seems ok for me.

Today I did a test again with chain syncing, the cpchain processus got killed for no reason https://imgur.com/a/QJpOYOJ and then I got all the go routine errors .

Thanks !

cpchainbot commented 4 years ago

No problem. -v $HOME/.cpchain:/root/.cpchain make a directory mapping between the machine and the container. You can check this dir cd $HOME/.cpchain.

When you run cpchain run, you can specify --datadir, the default value of this parameter is $HOME/.cpchain. So the data-dir of cpchain process running outside is the same as the process in the container.

If just got killed for no reason, maybe OOM and killed by os. You can use dmesg -T to check.

nuht commented 4 years ago

Perfect I understand everything now I hope it will be okay, i will give you an update tomorrow about the syncing.

I hope the node would be stable normally after syncing the chain like that...

Thanks for all details, it's very helpful !

nuht commented 4 years ago

No problem. -v $HOME/.cpchain:/root/.cpchain make a directory mapping between the machine and the container. You can check this dir cd $HOME/.cpchain.

When you run cpchain run, you can specify --datadir, the default value of this parameter is $HOME/.cpchain. So the data-dir of cpchain process running outside is the same as the process in the container.

If just got killed for no reason, maybe OOM and killed by os. You can use dmesg -T to check.

Hi, so today I have an answer about my problems. It was an OOM error, with the dmesg -T command I can see it run out of memory every hour or 30 mins... How is this possible ? Lot of people are running a node with same specs. Will I be able to run my node after syncing the chain without any more OOM error ?

cpchainbot commented 4 years ago

No, after syncing all blocks there won't get OOM anymore. We are trying to refine the syncing module with a pipeline method...

nuht commented 4 years ago

Hello, some news I could restart my rnode today I hope it will be stable.

I checked for curiosity the space of my cpchain folder it says 31g it seems a lot already, what do you think ?

https://imgur.com/a/aZ9mQwC

Do you think I can copy these 2.7 millions blocks somewhere on my personal computer in case I have trouble later, it would be quicker to sync everything I think.

cpchainbot commented 4 years ago

it's normal. After syncing all bocks, the space of data-dir is 45G maybe.

I think the better idea is to make snapshots of cloud disk periodically in your cloud server console. Copy data and transfer data will spend a lot of time.

nuht commented 4 years ago

it's normal. After syncing all bocks, the space of data-dir is 45G maybe.

I think the better idea is to make snapshots of cloud disk periodically in your cloud server console. Copy data and transfer data will spend a lot of time.

Thanks you so much for your help, I will do periodically snapshot of my disk in case I need to setup another vps from start ! I learned a lot, we should consider my issued solved I think.

nuht commented 4 years ago

Hi, today I got another crash with the cpchain processus. How could I know if this is still a OOM error ? That was running fine for almost 24 hours.

Maybe I should consider uppgrading my vps for 4gb ram. https://imgur.com/a/amwJW2r

cpchainbot commented 4 years ago

If this is still an OOM, you can know it from dmesg -T. If there are no messages in dmesg, that means this isn't OOM.

We suggest you use a VPS with 4GB RAM and 100GB SSD. This configuration of VPS can avoid many mistakes and can run a long time.

nuht commented 4 years ago

[Mon May 11 00:21:26 2020] Out of memory: Killed process 10862 (cpchain) total-vm:1767340kB, anon-rss:1655156kB, file-rss:0kB, shmem-rss:0kB

I was tired, I have no logs with the dmesg -T command since [Mon May 11 04:38:13 2020], it's not logic because it stopped like 4/5 hours ago just before midnight. So how can I be sure about the problem origin ?

cpchainbot commented 4 years ago

Sorry, it's hard to debug..... maybe you need to check the VPS's log in the cloud server console......

nuht commented 4 years ago

Don't worry for me I think i'm good for now I just uppgraded the vps, I think it was still an OOM because when I checked the ram usage it was every time very short. Thanks for your help !