Blocks storage unable to ingest samples older than 1h after an outage

pracucci commented 4 years ago

TSDB doesn't allow to append samples whose timestamp is older than the last block cut from the head. Given a block is cut from the head up until -50% of the max timestamp within the head and given the default block range period is 2h, this means that the blocks storage doesn't allow to append a sample whose timestamp is older than 1h compared to the most recent timestamp in the head.

Let's consider this scenario:

Multiple Prometheus servers remote writing to the same Cortex tenant
Some Prometheus servers stop remote writing to Cortex (for any reason, ie. networking issue) and they fall behind more than 1h
When the Prometheus servers will be back online, Cortex will discard any sample whose timestamp is older than 1h because the max timestamp in the TSDB head is close to "now" (due to the working Prometheus servers which never stopped to write series) while the failing ones are trying to catch up writing samples older than 1h

We recently had an outage in our staging environment which triggered this condition and we should find a way to solve it.

@bwplotka You may be interested, given I think this issue affects Thanos receive too.

pracucci commented 4 years ago

@codesome Given we have vertical compaction in TSDB and we can have overlapping blocks, what would be the implications to allow to write samples "out of bounds" in the TSDB head?

pstibrany commented 4 years ago

Noting other possible options (not necessarily good options):

extend block range (would lead to higher ingester memory usage, and longer WAL replays)
~change the cutting formula – I was trying to refer to 50% limit, but it's not related to cutting.~

bwplotka commented 4 years ago

blocks storage doesn't allow to append a sample whose timestamp is older than 1h compared to the most recent timestamp in the head.

Really? Can we find code path in Prometheus which does it?

bwplotka commented 4 years ago

Also, clock skew can cause it.

pstibrany commented 4 years ago

blocks storage doesn't allow to append a sample whose timestamp is older than 1h compared to the most recent timestamp in the head.

Really? Can we find code path in Prometheus which does it?

New block is cut from the head, when head (in-memory data) covers more than 1.5x of the block range. For 2 hours block range, it means that head needs to have 3h of data to cut a block. Block "start time" is always the minimum sample time in the head, while block "end time" is aligned on block range boundary. New block always covers single "block range" period. Data stored into a block is then removed from the Head. That means that after cutting the block, head will already have at least 1h of data, but possibly even more.

When writing new data via appender, minimum time limit is computed as Max(minT, maxT-0.5*block range), where minT and maxT are minimum/maximum sample times in the head. Limit is then enforced in Add and AddFast methods.

When head covers less than 0.5 of block range (<1h for 2h block range), samples cannot be older than min time in the head. When head covers more than 0.5 of block range (>1h for 2h block range), samples cannot be older than half block range since max time in the head.

bwplotka commented 4 years ago

Thanks for the explanation. I think we are getting into the backfilling area.

I think opening a side block and use vertical compaction would solve it.

bwplotka commented 4 years ago

We could start Prometheus discussion for it as well to enable that behavior if vertical compaction is enabled for TSDB itself.

pracucci commented 4 years ago

I think we are getting into the backfilling area.

It depends. How long time back is considered backfilling? From the Cortex (or Thanos receive) perspective, I think the issue we're describing here is not considered backfilling. At the current stage, we can't tolerate an outage longer than about 1h or we'll lose data.

brancz commented 4 years ago

Maybe it means that remote write is not good enough. You only have up to 2 hours to even still have that WAL around in this scenario, at which point it would be cut to a block. Maybe we need to discuss back-filling with blocks as a remote write extension. Just thinking out loud. (I think I briefly discussed this with @csmarchbanks once before)

bwplotka commented 4 years ago

I think the issue we're describing here is not considered backfilling. At the current stage, we can't tolerate an outage longer than about 1h or we'll lose data.

Where is the boundary?

pracucci commented 4 years ago

I think the issue we're describing here is not considered backfilling. At the current stage, we can't tolerate an outage longer than about 1h or we'll lose data.

Where is the boundary?

From the pure UX side, if it's about a Prometheus server catching up after an outage then I wouldn't consider it backfilling.

bwplotka commented 4 years ago

I guess the boundary is then 1.5x block size (once we exceed WAL (3h))?

bwplotka commented 4 years ago

Actually user can change that, so we can make even 2 years WAL. If some one will not upload things for 2y and suddenly want to put 2y old sample, would that be still not backfill :thinking: ?

brancz commented 4 years ago

Sounds like concretely for this, we need a way to cut a head block safely that doesn't have the 1.5x time requirement/heuristic.

bwplotka commented 4 years ago

Sounds like concretely for this, we need a way to cut a head block safely that doesn't have the 1.5x time requirement/heuristic.

What do you mean? When would you cut then? You don't know upfront what writes you expect to see but don't see, no?

brancz commented 4 years ago

The heuristic of allowing inserts up to 0.5x timespan of head blocks is based on the assumption that we can safely and correctly cut blocks at that time, I'm wondering what other strategies there might be. Clearly other databases do different things and time-based things are actually kind of weird in the first place. What I'm trying to say is, if we remove that requirement, then we might be able to think of ways how we can improve this situation (potentially combined with vertical compaction?).

csmarchbanks commented 4 years ago

Maybe we need to discuss back-filling with blocks as a remote write extension

I think I have some code sitting around somewhere that does this (I was using it to populate datasets from Prometheus into various backends that supported remote write). If there is interested I'd be happy to dig it up again.

we need a way to cut a head block safely that doesn't have the 1.5x time requirement/heuristic

Yes, that would be great. There were some ideas around this when we were discussing how to limit Prometheus memory usage weren't there? I remember at least something around space-based head block.

codesome commented 4 years ago

Catching up with emails now :) looks like I missed some discussions

@codesome Given we have vertical compaction in TSDB and we can have overlapping blocks, what would be the implications to allow to write samples "out of bounds" in the TSDB head?

While "out of bound" in TSDB would work fine, it needs some more discussion if it has to be upstream. Also, talking w.r.t. cortex, you will have an unexpected rise in memory consumption because Head block gets bigger than expected. (Additionally, vertical queries and compactions are a tad bit more expensive in terms of CPU and Memory)

I think opening a side block and use vertical compaction would solve it.

Is this idea for upstream Prometheus or Thanos/Cortex? But anyway, do we have any requirement that data is available for querying soon after ingesting?

extend block range (would lead to higher ingester memory usage, and longer WAL replays)

With the m-mapping work that is going on, the memory usage can be taken care of. And if this partial chunks work looks good to maintainers (follow up of m-map work), that would also take care of WAL replays :). But this would mean Cortex can increase it's block range, but the default in upstream Prometheus would need to be changed too so that WAL is kept around longer.

codesome commented 4 years ago

I would as much try to avoid adding samples older than Head mint and bring vertical compaction into play in the upstream Prometheus, because (1) Code is already complex enough (is that a valid argument? :P) (2) If not used it correctly, users will silently lose/corrupt data. (3) Unexpected spikes in CPU/Memory (maybe this should be expected?)

If this could be an optional flag (just like for overlapping data), we can forget about point 2 and 3.

pracucci commented 4 years ago

Also, talking w.r.t. cortex, you will have an unexpected rise in memory consumption because Head block gets bigger than expected.

Is this true even after the m-mapping work is complete?

I think opening a side block and use vertical compaction would solve it. But anyway, do we have any requirement that data is available for querying soon after ingesting?

Yes, we should be able to immediately query back samples pushed to the ingesters, like it's working for the chunks storage (this issue affects only the blocks storage).

If this could be an optional flag (just like for overlapping data), we can forget about point 2 and 3.

I was thinking about an optional flag (like overlapping data).

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

pracucci commented 4 years ago

This is still valid. @codesome has some ideas he wanna experiment.

bwplotka commented 4 years ago

Can you share those ideas Ganesh?

On Fri, 5 Jun 2020 at 09:05, Marco Pracucci notifications@github.com wrote:

This is still valid. @codesome https://github.com/codesome has some ideas he wanna experiment.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cortexproject/cortex/issues/2366#issuecomment-639325550, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABVA3OZY3M5UICNW2ZTVOQDRVCRNFANCNFSM4LXMKN6Q .

codesome commented 4 years ago

Can you share those ideas Ganesh?

It was one of my TODO for today :) I will share with you all once I have written it down.

codesome commented 4 years ago

This is an idea from the top of my head, needs more thought

The root problem

These 2 checks, this and this

Solution

Enable vertical compaction and querier and make that check optional - upstream will enable the check whereas Cortex and Thanos can turn that off.

Why does this work?

Because we are not discarding out of bound sample and the sample would just get added to it's series. Out of order samples in a series are still discarded.

If a series lagging behind in time causes overlap with data on disk, the vertical querier will take care of deduping.

After compaction of head, the vertical blocks are given top priority and they will get merged.

Any gotchas? Yes

Head compaction is based on the mint and maxt of the Head. One can generate synthetic data to add just 2 samples per series ranging over 3h and cause lots of unnecessary Head compactions. This can be taken care of manually calling Compact() with some logic around it instead of calling it very often.
It is possible to add samples with time range spanning a very large time range - this can end up compacting all blocks into one. One possible solution is instead of removing the out of bound check entirely, make the minValidTime configurable so that you can decide how long back in time you want to accept.

pracucci commented 4 years ago

Thanks @codesome for sharing your idea!

As you stated, a downside of this approach is that you may end up with a single large block which will impact the compaction. This is a solvable problem, replacing TSDB compactor planner with a custom one (ie. we could exclude those not-aligned blocks from compaction or compact them together without impacting correctly-aligned blocks).

The longer the time range for such blocks is, the more it's problematic at query time so, as you stated, we may add a limit to the oldest timestamp we do allow to ingest (now - threshold). This would practically make this system not working for backfilling purposes (because the threshold would be in terms of hours, not days/months) but may solve the problem described in this issue.

After compaction of head, the vertical blocks are given top priority and they will get merged.

Ideally we don't want any vertical compaction occur in the ingesters. Vertical compaction will be done by the Cortex compactor later on. What do you think?

bwplotka commented 4 years ago

Sorry I am missing what is the solution here...

Enable vertical compaction and querier and make that check optional - upstream will enable the check whereas Cortex and Thanos can turn that off.

Can we elaborate?

Is this essentially what we proposed for backfilling? Start another TSDB for each out of band request (and keep it for some time until cut)? That's would be quite neat. I don't get how it can produce large blocks - it's exactly the same as you would have in total with all out band data included :thinking:

codesome commented 4 years ago

Can we elaborate?

Here the "check" is checking of min valid time for the samples - where we discard samples before 1h. When I say make that check optional, we remove that particular check in Cortex/Thanos and allow any timestamp as long as it is not out-of-order within the same series. Later in the comment, I suggested having configurable min valid time instead of removing the check completely.

So it is not the same as what was proposed for backfilling. In my solution, there is no new TSDB running on the side, it's the main TSDB which will allow samples back in time. Now, with this in mind, you can read the my above comment https://github.com/cortexproject/cortex/issues/2366#issuecomment-639505032 again and hopefully it will be clear this time :)

This would practically make this system not working for backfilling purposes

We could have another route to do the backfilling and not via the main ingest way. That would be the thing that Bartek is mentioning above - start a TSDB on the side to specially take care of backfilling. My solution only addresses ingesting old samples which might be lagging because of outage at cortex or Prometheus remote-write behind.

I don't get how it can produce large blocks

So in the backfilling work happening in Prometheus, we are making sure that the new blocks dont cause an overlap between 2 blocks. An example reference is here for the new blocks created. But in case of the proposed solution, if we take any timestamp, the head block could have mint which is less than the mint of the oldest block. Hence after head compaction, the block will overlap all the blocks on the disk and hence cause vertical compaction with all of them to end up with a single big block.

codesome commented 4 years ago

Starting another TSDB on the side for non-backfilling purposes is a lot of complexity for the normal ingestion path. So I suggest we have a separate API itself for backfilling which will explicitly start another TSDB and merge with the main TSDB later. And in case of lagging samples for the usual ingestion, I suggest the above solution. WDYT?

pracucci commented 4 years ago

I agree to do an experiment on @codesome proposal. I also agree that opening parallel TSDBs adds complexity which I would like to avoid if possible. I think we're fine being able to ingest samples only up to X hours ago for this use case; if the solution doesn't come with exceeded complexity, than it would significantly relax the issue with a low effort.

As I mentioned previously, we'll probably have to do some changes to the compactor planner to get these larger blocks compacted together (ie. if a block time range is over the max compaction time range)but it's something we've already experimented on and we know it's doable.

bwplotka commented 4 years ago

Thanks for this explanation, now it makes total sense. Actually this proposal has the same effect as running TSDB in parallel (minus the ability to backfill series already existing in the head), just without that complexity, so amazing!

I guess with some work we could make sure the same TSDB handle out of order samples as well (with some limits), so this sounds quite nice!

In both cases with have problems with compactions (again) yes, so we need either planner, or some compaction max size boundaries to be fully safe. Not too large min Time (which you can configure) makes sense for a start.

Nice :+1: Happy with this and we will use this in Receiver as well (cc @squat, @brancz) :muscle: Thanks!

pracucci commented 4 years ago

In both cases with have problems with compactions (again) yes, so we need either planner, or some compaction max size boundaries to be fully safe. Not too large min Time (which you can configure) makes sense for a start.

Agree, but we've already experimented with a custom compactor planner and it's doable. I think the logic to handle this won't be much complicated.

squat commented 4 years ago

That's a nice workaround. If we go with @codesome's modified proposal of having a configurable min valid time, then one UX thing that we'll need to document well in order to not confuse users is that you may be able to ingest things older than the threshold, e.g.when setting the threshold to X hours, you may be able to ingest data up to X+2 hours old.

codesome commented 4 years ago

Opened an issue in Prometheus with this proposal https://github.com/prometheus/prometheus/issues/7396

PS: I haven't tested it yet but it will work as far as I understand.

codesome commented 4 years ago

The above idea was discarded for now because of 1 annoying issue: https://github.com/prometheus/prometheus/issues/7396#issuecomment-648670168

codesome commented 4 years ago

(Update: this idea has been discarded. See https://github.com/cortexproject/cortex/issues/2366#issuecomment-667025973)

Here is a design doc to solve this using multiple TSDBs: https://docs.google.com/document/d/1Cw0P-GQpzY70wMkp-S5xTdCArJQv88qKMzci_rv_ZTs/edit?usp=sharing cc @bwplotka

codesome commented 4 years ago

The above proposal having multiple TSDBs open for backfilling was discarded because of lot of unnecessary complexity that it brings into the code and the additional memory required in the worst case (for example, if you want to backfill upto 6h old samples, you might need upto 6-7x the memory).

With the scope of this being only to recover from long outage either on the Prometheus side or Cortex, and not actually backfilling old data, here is a simplified design:

Taking an example of allowing samples upto 7h old (this will be configurable), 1h is taken care by the main TSDB, and for the remaining 6h we maintain only 2 TSDBs.

Both the new TSDBs will be able to ingest 6h of data (this is done by keeping the block range to 12h). The time ranges allowed in those blocks will be aligned with 6h (or the configured time).

Here is some visualization (M is the current TSDB. X and Y are the newly introduced TSDBs):

1.
                   <     6h     >
                   |            |---|
                   |            | M |
                   |            |---|
                   |------------|
                   |      X     |
                   |------------|

2. Time is moving, both X and Y will cover that 6h partially.

                   <     6h     >
                   |            |---|
                   |            | M |
                   |            |---|
            |------------|------------|
            |      X     |      Y     |
            |------------|------------|

3. First one goes beyond max configured, hence compact and ship the block

                   <     6h     >
                   |            |---|
                   |            | M |
                   |            |---|
      |------------|------------|
      |      X     |      Y     |
      |------------|------------|

4. The TSDBs are rotated at this point, and the same thing continues.

                   <     6h     >
                   |            |---|
                   |            | M |
                   |            |---|
            |------------|------------|
            |      Y     |      X     |
            |------------|------------|

We have two of them as when one is compacting, the other should be up. And as the time is moving, both together cover that 6h gap.

Small implementation details:

The new TSDBs are created iff any sample is received for those time range and it was not deemed out-of-order in the main TSDB.
The 7h max age is calculated w.r.t. the wall clocks and not the max time of the main TSDB, this is to avoid sudden advancement of that maxt which can cause both old TSDBs to be unavailable as they need to be compacted before they can ingest new data.
The shutdown, flush and shutdown handler, etc, on these new TSDBs, will behave similar to the main TSDB.
Transfers during handover won't be done for these new TSDBs (moreover, handover might be entirely removed for block https://github.com/cortexproject/cortex/issues/2966)

Resource:

This can take upto 3x the memory if all the tenants plan to send the old data for the max age (with memory-mapping of head chunk, it's only the number of series that matters the most for the memory).
Needs more disk space to support the new TSDBs. Exact/rough numbers yet to be calculated for this.

brancz commented 4 years ago

This can take upto 3x the memory if all the tenants plan to send the old data for the max age (with memory-mapping of head chunk, it's only the number of series that matters the most for the memory).

I don't understand what this sentence means. Does it take 3x memory or is that overhead optimized away by mmaping of head chunks?

codesome commented 4 years ago

Does it take 3x memory or is that overhead optimized away by mmaping of head chunks?

It can take upto 3x more memory if all 3 TSDBs are in action. With mmapping of head, the main factor that decides the memory taken by TSDB is number of series in head and the time range does not matter, because only 1 chunk per series and the labels/index will be stored in the memory at any point of time and remaining chunks on disk.

So in normal usage when you don't need to ingest old data, there is 0 overhead, because no additional TSDB is open. If one more TSDB comes into actions to backfill old data and if it has a similar set of series that of main TSDB, then it's like 2x the memory consumption as before. If third TSDB also comes into action because the samples are spanning a wider range, then 3x in the worst case. And this is all for per tenant memory increase. So to hit 3x memory usage for entire ingesters, all tenants should be sending old data at the same time (if it's just 1 tenant, then the probability increases).

brancz commented 4 years ago

Understood thank you for explaining. I expect that this happens primarily on extended downtime of the backend, so it's rather likely that this happens for many tenants at once no? In such a scenario do you think the resulting thundering herd would be managable?

itzg commented 4 years ago

In our case the thundering herd is expected since our tenants will demand that the 6 hours of backfill be eventually available to visualize at some point. After an outage, spinning up larger memory pods seems like a reasonable operation to accommodate this design.

Thanks, @codesome , this is looking very promising.

codesome commented 4 years ago

In such a scenario do you think the resulting thundering herd would be managable?

If Cortex is down, ingesting >1h old samples will come into picture only when tenants have multiple Prometheus instances and some instances catch up faster than the others hence pushing the maxt of the main TSDB ahead. If not, it will be a single TSDB as before.

Considering the worst case of all requiring ingestion of >1h old samples, if the ingesters are already provisioned to use at least 3x the normal memory usage, I think things should be fine in terms of memory (with m-mapping the memory used will be capped in absence of queries). But CPU throttling can be expected, driving the latency higher, and maybe leading to distributors OOMing.

This is all considering ideal scenario where user does not have lot of churn. But from what we have learnt from past outages in the non-blocks version of Cortex, it is advised to scale up the ingesters (also the distributors) when the cluster is recovering from an outage to manage the thundering herd.

bwplotka commented 4 years ago

Hey, thanks for this. This can work, but in fact, what we propose here to me is literally multi TSDB (: so: How this is different than multiple TSDBs capped at T-7h hours? Is it block range being 6h?

pracucci commented 4 years ago

We're still working on it

MedoDome commented 3 years ago

Any information about this? I would like to add some historical data into Cortex

pracucci commented 3 years ago

Any information about this? I would like to add some historical data into Cortex

The approach we took in #3025 should work but has some downsides:

The backfilling code path is exercised only during a backfilling, so bugs may slip in and we may not notice it until a backfilling event happens
The ingester memory utilisation could potentially grow 3x, risking to hit the ingester memory limit and causing an outage. If the backfilling covers the case of recovering after an outage, we may end up with an outage after another one

We're still evaluating pros/cons of #3025, while discussing alternative ideas.

MedoDome commented 3 years ago

Thank you on the fast reply, I appreciate it

bboreham commented 3 years ago

Any update after a few months?

rajsameer commented 3 years ago

hey guys, We had a similar issues , where when writing metrics to cortex for a new tenant after few minutes the ingester would start throwing out of bound error. The reason we found for this is we mistakenly did not disable alert and rules in our Prometheus and added the same rules and alerts for the tenant in cortex. Now if the alert gets evaluated by the cortex first and then it receives the sample form Prometheus which is also calculating the same rules, ingester would start throwing out of bound error , and this will happen for ever because Prometheus will keep on trying to send the same data and cortex would reject. So just to test this hypothesis we disabled rules and alerts on cortex and remote write is working fine. Should we add a note in the documentation stating not have same rules in Prometheus sand alert manager.

damnever commented 3 years ago

We run into this most likely due to the distributor gets overloaded... this means the middlewares(gateway/proxy) and unstable network may cause this issue.

cortexproject / cortex

Blocks storage unable to ingest samples older than 1h after an outage #2366