Particular / ServiceControl

Backend for ServiceInsight and ServicePulse
https://docs.particular.net/servicecontrol/
Other
51 stars 47 forks source link

Exhausts the disk space and then fails to start #701

Closed dsaf closed 3 years ago

dsaf commented 8 years ago

Scenario:

1) Constrained environment with few disk space available 2) Sudden very big spike in error messages (thousands) 3) RavenDB database explodes to gigabytes using up all available disk space 4) ServiceControl fails to start due to lack of available disk space 5) I cannot use ServicePulse to free up disk space by getting rid of errors because ServiceControl won't start 6) Dead end.

Could it instead process messages gradually? Like only take first 1000 and say: "more messages available in error queue, please archive/replay current ones to load the next batch".

Thank you.

dsaf commented 8 years ago

Can I clean the RavenDB manually?

indualagarsamy commented 8 years ago

@dsaf - what version of ServiceControl are you currently on?

dsaf commented 8 years ago

@indualagarsamy I am using a stable build ServiceControl V 1.8.3.

Are there release notes on subsequent builds highlighting a change that fixed this? Thank you.

indualagarsamy commented 8 years ago

@dsaf - Yes. There's been several fixes especially in the 1.10.0 and 1.11.0 releases. I highly encourage you to upgrade your current version of SC. You can find the release notes here for all our releases for the details on the bug fixes: https://github.com/Particular/ServiceControl/releases/

Also, FYI, we are currently working on https://github.com/Particular/ServiceControl/pull/693.

dsaf commented 8 years ago

@indualagarsamy I understand, thank you very much. I will try upgrading and update the issues accordingly.

indualagarsamy commented 8 years ago

@dsaf - Thank you! Looking forward to hearing back from you after your upgrade. Thanks for letting us know and keeping us in the loop. Much appreciated.

johnsimons commented 8 years ago

@dsaf Just for your info, currently ServiceControl does not delete error messages, so when error messages are archived they will remain in its internal database and still occupy space. We are currently working on fixing this issue and the next release will address it.

Unfortunately, ServiceControl does not monitor low disk usage, maybe this is something we need to monitor and then stop the ingestion of messages before it is too late, @Particular/servicecontrol-maintainers thoughts ?

gbiellem commented 8 years ago

ServiceControl does not monitor low disk usage, maybe this is something we need to monitor

@johnsimons :+1: - stopping ingestion on low disk space is a perfect example of were the the circuit breaker/hysterix change you spiked would come in

mauroservienti commented 8 years ago

ServiceControl does not monitor low disk usage, maybe this is something we need to monitor and then stop the ingestion

would be awesome. The thing I'd love (and I'm pretty sure @ramonsmits would fall in love as well with) is the ability to stop ingesting in a selective manner without stopping the bus in it's entirety, so for example being able to pause the error satellite, or the audit one, but letting the "main" bus up & running so to be able to process heartbeats and send out messages or issue retries.

mikeminutillo commented 8 years ago

Unfortunately, ServiceControl does not monitor low disk usage, maybe this is something we need to monitor and then stop the ingestion of messages before it is too late,

I don't know if that's really our concern is it? If we run out of HDD won't the message processing fail when we try to persist it and eventually the circuitbreaker will blow and bring down SC.

The thing I'd love (and I'm pretty sure @ramonsmits would fall in love as well with) is the ability to stop ingesting in a selective manner without stopping the bus in it's entirety

We could probably do that pretty simply. When we implemented the updated retries we added a start/stop satellite.

gbiellem commented 8 years ago

I don't know if that's really our concern is it?

@mikeminutillo I think it is when the source of the out of disk space is ServiceControl and the result is the monitoring dies without warning. I believe a more graceful approach than letting the circuitbreaker fire and kill off SC is needed.

ramonsmits commented 8 years ago

@gbiellem @mikeminutillo This is a slippery slope. In my opinion disk storage should be monitored by monitoring tools like SCOM, SolarWinds, NewRelic, Firescope, etc.

I do think its weird that SC does not start at ALL due to the circuit breaker. The API should not be down because the DB is full. That is the issue that needs to be addressed.

Pause/resume of ingestion then there are errors makes perfect sense here to keep the API available.

A second issue is that from a operational management perspective you would pre allocate disk storage when having for example a SQL Server. The whole issue here is that RavenDB is embedded and all such configuration is hidden.

pablocastilla commented 8 years ago

Hi!

We are using a very large environment with NSB. What I would like to have is not the option for cleaning up audit in days as we have now, I would like it in space. I have a 100GB ssd disk for SC, so sometimes it gets filled and all gets stuck. I would like an option to tell NSB that it has 99,9GB and it has to manage it in order to keep always error messages and delete auditing to not to fill the disk and keep processing auditing messages.

Just my two cents.

johnsimons commented 8 years ago

@pablocastilla interesting request, so we would calculate the disk space occupied by audited messages (I actually think Raven has a way of calculating disk space based on a document collection). What about if the bottle neck are the error messages ? So the thing that is filling the disk are errors ?

pablocastilla commented 8 years ago

Well, in that case it can stop collecting and give an option for deleting error messages.

The problem is that if it stops collecting auditing from machines it fills the business msmqs because it can't send auditing messages to SC

El lun., 18 abr. 2016 11:16, John Simons notifications@github.com escribió:

@pablocastilla https://github.com/pablocastilla interesting request, so we would calculate the disk space occupied by audited messages (I actually think Raven has a way of calculating disk space based on a document collection). What about if the bottle neck are the error messages ? So the thing that is filling the disk are errors ?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/Particular/ServiceControl/issues/701#issuecomment-211290310

mauroservienti commented 8 years ago

When using MSMQ it gets tricky pretty quickly, IMO, even if I agree, and like, with @pablocastilla suggestion.

SC reaches the space limit it is configured for, we could stop collecting or start deleting old messages or both, the tricky part in stop collecting is that regardless of the transport used the audit Qs start to fill up till the transport stops working. On MSMQ this can cause quota errors on the receiver side, quota errors on the sender side and dead letter Qs filling up,. on broker based transports this can cause the entire broker to stop working is a much worse scenario that can lead to message loss.

johnsimons commented 8 years ago

@pablocastilla regarding

in that case it can stop collecting and give an option for deleting error messages

How do we convey this to the user ? Remember ServiceControl has no UI and we don't know if users have ServicePulse installed.

johnsimons commented 8 years ago

There is another option, we could keep collecting audits but not save them, so in essence we would just do a dequeue (forward the message to forwarding queue if configured) and then ignore it. This is not as bad as it sounds, remember audits are only used for stats, so missing these is not a big issue. We actually recommend turning audits off if a user is not taking advantage of it.

pablocastilla commented 8 years ago

why not deleting the old audit messages? I would prefer missing older ones, have the maximum recent audit window :). anyway that would be better than it is now.

johnsimons commented 8 years ago

why not deleting the old audit messages?

@pablocastilla yes we would do that, but eventually the errors could make the max diskspace go over and in that situation we would stop the insertion of audits

pablocastilla commented 8 years ago

For me that would be ok, we monitor that folder with nagios so we would realize. I just want to minimize the damage before fixing.

Fixing could be deleting or replying the errors.

joel100 commented 5 years ago

@dsaf Just for your info, currently ServiceControl does not delete error messages, so when error messages are archived they will remain in its internal database and still occupy space. We are currently working on fixing this issue and the next release will address it.

Unfortunately, ServiceControl does not monitor low disk usage, maybe this is something we need to monitor and then stop the ingestion of messages before it is too late, @Particular/servicecontrol-maintainers thoughts ?

Can you confirm that the error message deletion issue is fixed sometime between this comment and the latest release?

WilliamBZA commented 5 years ago

@joel100 I'm not sure what you're asking? ServiceControl never automatically deletes error messages unless they are archived - these messages could contain important business data and should never be automatically deleted.

kbaley commented 3 years ago

Relabeled as an improvement rather than a bug based on the discussion. There are options presented to make the experience nicer once disk space starts getting low.

SzymonPobiega commented 3 years ago

There is now a custom check that verifies there is enough space for SC to run and alerts is free space is shrinking.

ramonsmits commented 3 years ago

@SzymonPobiega I'm not sure this should be closed. You state that a notification is sufficient. I would think that SC should stop ingestion once it reaches a certain limit to prevent corruption.