Closed dsaf closed 3 years ago
Can I clean the RavenDB manually?
@dsaf - what version of ServiceControl are you currently on?
@indualagarsamy I am using a stable build ServiceControl V 1.8.3.
Are there release notes on subsequent builds highlighting a change that fixed this? Thank you.
@dsaf - Yes. There's been several fixes especially in the 1.10.0 and 1.11.0 releases. I highly encourage you to upgrade your current version of SC. You can find the release notes here for all our releases for the details on the bug fixes: https://github.com/Particular/ServiceControl/releases/
Also, FYI, we are currently working on https://github.com/Particular/ServiceControl/pull/693.
@indualagarsamy I understand, thank you very much. I will try upgrading and update the issues accordingly.
@dsaf - Thank you! Looking forward to hearing back from you after your upgrade. Thanks for letting us know and keeping us in the loop. Much appreciated.
@dsaf Just for your info, currently ServiceControl does not delete error messages, so when error messages are archived they will remain in its internal database and still occupy space. We are currently working on fixing this issue and the next release will address it.
Unfortunately, ServiceControl does not monitor low disk usage, maybe this is something we need to monitor and then stop the ingestion of messages before it is too late, @Particular/servicecontrol-maintainers thoughts ?
ServiceControl does not monitor low disk usage, maybe this is something we need to monitor
@johnsimons :+1: - stopping ingestion on low disk space is a perfect example of were the the circuit breaker/hysterix change you spiked would come in
ServiceControl does not monitor low disk usage, maybe this is something we need to monitor and then stop the ingestion
would be awesome. The thing I'd love (and I'm pretty sure @ramonsmits would fall in love as well with) is the ability to stop ingesting in a selective manner without stopping the bus in it's entirety, so for example being able to pause the error
satellite, or the audit
one, but letting the "main" bus up & running so to be able to process heartbeats and send out messages or issue retries.
Unfortunately, ServiceControl does not monitor low disk usage, maybe this is something we need to monitor and then stop the ingestion of messages before it is too late,
I don't know if that's really our concern is it? If we run out of HDD won't the message processing fail when we try to persist it and eventually the circuitbreaker will blow and bring down SC.
The thing I'd love (and I'm pretty sure @ramonsmits would fall in love as well with) is the ability to stop ingesting in a selective manner without stopping the bus in it's entirety
We could probably do that pretty simply. When we implemented the updated retries we added a start/stop satellite.
I don't know if that's really our concern is it?
@mikeminutillo I think it is when the source of the out of disk space is ServiceControl and the result is the monitoring dies without warning. I believe a more graceful approach than letting the circuitbreaker fire and kill off SC is needed.
@gbiellem @mikeminutillo This is a slippery slope. In my opinion disk storage should be monitored by monitoring tools like SCOM, SolarWinds, NewRelic, Firescope, etc.
I do think its weird that SC does not start at ALL due to the circuit breaker. The API should not be down because the DB is full. That is the issue that needs to be addressed.
Pause/resume of ingestion then there are errors makes perfect sense here to keep the API available.
A second issue is that from a operational management perspective you would pre allocate disk storage when having for example a SQL Server. The whole issue here is that RavenDB is embedded and all such configuration is hidden.
Hi!
We are using a very large environment with NSB. What I would like to have is not the option for cleaning up audit in days as we have now, I would like it in space. I have a 100GB ssd disk for SC, so sometimes it gets filled and all gets stuck. I would like an option to tell NSB that it has 99,9GB and it has to manage it in order to keep always error messages and delete auditing to not to fill the disk and keep processing auditing messages.
Just my two cents.
@pablocastilla interesting request, so we would calculate the disk space occupied by audited messages (I actually think Raven has a way of calculating disk space based on a document collection). What about if the bottle neck are the error messages ? So the thing that is filling the disk are errors ?
Well, in that case it can stop collecting and give an option for deleting error messages.
The problem is that if it stops collecting auditing from machines it fills the business msmqs because it can't send auditing messages to SC
El lun., 18 abr. 2016 11:16, John Simons notifications@github.com escribió:
@pablocastilla https://github.com/pablocastilla interesting request, so we would calculate the disk space occupied by audited messages (I actually think Raven has a way of calculating disk space based on a document collection). What about if the bottle neck are the error messages ? So the thing that is filling the disk are errors ?
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/Particular/ServiceControl/issues/701#issuecomment-211290310
When using MSMQ it gets tricky pretty quickly, IMO, even if I agree, and like, with @pablocastilla suggestion.
SC reaches the space limit it is configured for, we could stop collecting or start deleting old messages or both, the tricky part in stop collecting is that regardless of the transport used the audit Qs start to fill up till the transport stops working. On MSMQ this can cause quota errors on the receiver side, quota errors on the sender side and dead letter Qs filling up,. on broker based transports this can cause the entire broker to stop working is a much worse scenario that can lead to message loss.
@pablocastilla regarding
in that case it can stop collecting and give an option for deleting error messages
How do we convey this to the user ? Remember ServiceControl has no UI and we don't know if users have ServicePulse installed.
There is another option, we could keep collecting audits but not save them, so in essence we would just do a dequeue (forward the message to forwarding queue if configured) and then ignore it. This is not as bad as it sounds, remember audits are only used for stats, so missing these is not a big issue. We actually recommend turning audits off if a user is not taking advantage of it.
why not deleting the old audit messages? I would prefer missing older ones, have the maximum recent audit window :). anyway that would be better than it is now.
why not deleting the old audit messages?
@pablocastilla yes we would do that, but eventually the errors could make the max diskspace go over and in that situation we would stop the insertion of audits
For me that would be ok, we monitor that folder with nagios so we would realize. I just want to minimize the damage before fixing.
Fixing could be deleting or replying the errors.
@dsaf Just for your info, currently ServiceControl does not delete error messages, so when error messages are archived they will remain in its internal database and still occupy space. We are currently working on fixing this issue and the next release will address it.
Unfortunately, ServiceControl does not monitor low disk usage, maybe this is something we need to monitor and then stop the ingestion of messages before it is too late, @Particular/servicecontrol-maintainers thoughts ?
Can you confirm that the error message deletion issue is fixed sometime between this comment and the latest release?
@joel100 I'm not sure what you're asking? ServiceControl never automatically deletes error messages unless they are archived - these messages could contain important business data and should never be automatically deleted.
Relabeled as an improvement rather than a bug based on the discussion. There are options presented to make the experience nicer once disk space starts getting low.
There is now a custom check that verifies there is enough space for SC to run and alerts is free space is shrinking.
@SzymonPobiega I'm not sure this should be closed. You state that a notification is sufficient. I would think that SC should stop ingestion once it reaches a certain limit to prevent corruption.
Scenario:
1) Constrained environment with few disk space available 2) Sudden very big spike in error messages (thousands) 3) RavenDB database explodes to gigabytes using up all available disk space 4) ServiceControl fails to start due to lack of available disk space 5) I cannot use ServicePulse to free up disk space by getting rid of errors because ServiceControl won't start 6) Dead end.
Could it instead process messages gradually? Like only take first 1000 and say: "more messages available in error queue, please archive/replay current ones to load the next batch".
Thank you.