etsy / 411

An Alert Management Web Application
https://demo.fouroneone.io
MIT License
969 stars 112 forks source link

Server keeps crashing #93

Closed ogkpmg closed 7 years ago

ogkpmg commented 7 years ago

Hello everyone,

Is anyone else experiencing the issue where if the server is left on, over time, it becomes unresponsive?

I restarted the server and now I get this warning: search_phase_execution_exception: all shards failed

and now I cant log in.

How often does this happen?

ogkpmg commented 7 years ago

found something interesting, the database continuously gets locked from having too many processes trying to insert into the database rapidly. May be an important note for other users.

ogkpmg commented 7 years ago

Solution to unlock db: cmd: fuser data.db

output: /var/www/411/data.db: 3602 3625 3943 3944

cmd: kill 3625 kill 3943 sqlite3 data.db .databases .backup main backup.sqlite .exit mv data.db old.data.db mv backup.sqlite data.db

Looking at methods of configuring multiple inserts (since there may be multiple servers inserting logs to the server) or configuring it so that one log waits until the previous log completes their insert before proceeding.

ogkpmg commented 7 years ago

looks like worker.php is taking up most of these process.

Still figuring out how to configure the wait-time for the processes to insert records to the database table.

If anyone has any input and/or recommendations, please feel free to chime in.

kiwiz commented 7 years ago

You can disable worker execution via the admin interface at /admin while you debug this.

kiwiz commented 7 years ago

Do you know how many alerts you're getting per minute? If it's high, it's possible that sqlite can't handle the volume.

ogkpmg commented 7 years ago

hello kai,

it looks like there is a log file that microsoft put on all their linux vm's called: waagent.log

which they use for their own alerting tools within the azure dashboard, but is very limited.

they are generating tons of logs every second unfortunately.

do you have any knowledge on how to configure sqlite to queue, store-then-forward logs, etc.?

ogkpmg commented 7 years ago

also,

I did as you recommended and made my way to http://411instanceurlname/admin

and disabled worker.php and that has improved the website response time and it is not utilizing the data.db file.
image

thanks for the tip.

kiwiz commented 7 years ago

SQLite is just used for 411's database, not for any log management. If you're turning all of those logs into 411 alerts, that's probably the problem.

ogkpmg commented 7 years ago

Understood that sqlite is used for just the 411's database, but those logs may or may not be items that is deemed alert worthy to report out on... is there on way to configure it to continue to receive these bits of information with out locking out the db file?

kiwiz commented 7 years ago

I don't think so. SQLite has to lock the db to write the job information 411 uses. To resolve the backlog, you can try running a single instance of worker.php manually? I'd recommend moving over to MySQL if that's a possibility.

kiwiz commented 7 years ago

Is there anything else I can do to help? Will close this ticket if it's no longer an issue.

ogkpmg commented 7 years ago

Hello Kai,

I have tried exporting the DB to a mysql server and ran into some errors exporting and importing it.

I was thinking of counting my losses and creating a new mysql db from scratch.

I have not had time to perform creating the new db yet.