bruno-garcia commented 1 year ago

Problem Statement

Users want to be able to keep only replays for 48 hours.

48h is a compliance requirement for us due to RGPD policy. We need all text, media and user input. But we only need them for 48h.

Solution Brainstorm

Allow bulk delete of selected replays (where filter would be the replay time).

GuillaumeCisco commented 1 year ago

Also adding the possibility to automatically delete replays older than x times from the settings :)

bruno-garcia commented 6 months ago

Also adding the possibility to automatically delete replays older than x times from the settings :)

That's something we offer to enterprise orgs because it requires operations to change configuration so some work involved. We don't have plans to offer a setting in the product for this at the moment

GuillaumeCisco commented 6 months ago

Also adding the possibility to automatically delete replays older than x times from the settings :)

That's something we offer to enterprise orgs because it requires operations to change configuration so some work involved. We don't have plans to offer a setting in the product for this at the moment

I have an entreprise org account and I have never been contacted for that despite all my emails...

bruno-garcia commented 6 months ago

Also adding the possibility to automatically delete replays older than x times from the settings :)

That's something we offer to enterprise orgs because it requires operations to change configuration so some work involved. We don't have plans to offer a setting in the product for this at the moment

I have an entreprise org account and I have never been contacted for that despite all my emails...

Could you please send me an email with your Sentry org slug? to my first name @ sentry .io I'll make sure folks reach out to you

bruno-garcia commented 6 months ago

After a conversation internally changing the retention isn't something we can offer anytime soon.

Keeping this ticket for Bulk delete specifically

I found an internal document that goes in detail, I'll pull some of the technical details here:

Clickhouse stores data in columnar format. Each column gets written to a separate file. The unit in which data gets written is called parts. Parts are directories on the filesystem. Individual parts belong to a higher abstraction called partitions. Conceptually, it looks like this

graph TD
  Part_1_Column_A --> Part_1
  Part_1_Column_B --> Part_1
  Part_2_Column_A --> Part_2
  Part_2_Column_B --> Part_2
  Part_3_Column_A --> Part_3
  Part_3_Column_B --> Part_3
  Part_4_Column_A --> Part_4
  Part_4_Column_B --> Part_4
  Part_1 --> Partition_1
  Part_2 --> Partition_1
  Part_3 --> Partition_2
  Part_4 --> Partition_2

Each table in clickhouse defines a partition_key which determines in which part data should be written. More information about partition_key can be found [here](https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/custom-partitioning-key/).

Let’s talk about the partition_key most commonly used in Sentry. Our tables have a partition_key which looks like this

PARTITION BY (retention_days, toMonday(timestamp))

retention_days refers to the time data needs to be stored in Clickhouse. timestamp represents the time Event is received. So toMonday(timestamp) means the Monday preceding the timestamp. An example of all partitions on the errors cluster when the query is run on December 5, 2022 looks something like this.

The difference between Active and Inactive parts boils down to how fast Clickhouse can merge the old parts to new parts. Merging data from old parts to new parts happens in the background. The operation is similar to the merge phase of merge sort. The reason Clickhouse performs the merge is to reduce the number of files that needs to be searched through while serving a query. Lesser the number of files that need to be read from faster the queries are going to run.

From the data above, we can see that the number of parts being written to the most recent partitions is more than the older partitions. That is expected since new data usually arrives with newer timestamps and the number of writes happening on older weeks keep reducing overtime.

Possible impacts of adding more retention days

Let’s assume we add 1 more bucket of retention days which is 60. Let’s also assume that each table has 30 columns which would mean 30 files needed for each Part. We will look at how the total number of parts and files changes with new retention bucket.

[Untitled Database](https://www.notion.so/07fa5ec4bed242eabbb8e06b40ffd5c2?pvs=21)

That means adding 60 day retention would cause the number of files which are used by Clickhouse to almost double. Let’s look at the impacts of this change.

Impact on writes/INSERT’s

Since we use retention days as part of the partitioning key, there is more work that needs to be performed while performing INSERT’s. When data is INSERTed into Clickhouse, the database needs to look into the timestamp field and create new parts/files for the additional retention period. This would introduce more work needed to be performed during an individual INSERT operation.

Impact on Zookeeper

Metadata of each part written to Clickhouse gets stored in Zookeeper. This metadata is used for replicating parts from one node of the cluster to another. Writing more parts would imply writing more entries on the Zookeeper logs and increased usage of the replication queue.

Impact on background merges

Since INSERT’s are creating more Parts, the work of background merge increases since now it needs to perform the merge operation on more number of Parts. Background merges cause both I/O load and CPU load increases on the system.

Impact on reads/SELECT’s

When querying for data, one of the phases Clickhouse goes through is figuring out which Parts to read data from. The query itself does not know which Parts to read data from. The index files help with answering that question. Having more Parts would mean having more index files as each Part has its own index file which needs to fit into memory.

Possible impacts of adding more retention days

Let’s assume we add 1 more bucket of retention days which is 60. Let’s also assume that each table has 30 columns which would mean 30 files needed for each Part. We will look at how the total number of parts and files changes with new retention bucket.

Impact on writes/INSERT’s

Since we use retention days as part of the partitioning key, there is more work that needs to be performed while performing INSERT’s. When data is INSERTed into Clickhouse, the database needs to look into the timestamp field and create new parts/files for the additional retention period. This would introduce more work needed to be performed during an individual INSERT operation.

Impact on Zookeeper

Metadata of each part written to Clickhouse gets stored in Zookeeper. This metadata is used for replicating parts from one node of the cluster to another. Writing more parts would imply writing more entries on the Zookeeper logs and increased usage of the replication queue.

Impact on background merges

Since INSERT’s are creating more Parts, the work of background merge increases since now it needs to perform the merge operation on more number of Parts. Background merges cause both I/O load and CPU load increases on the system.

Impact on reads/SELECT’s

When querying for data, one of the phases Clickhouse goes through is figuring out which Parts to read data from. The query itself does not know which Parts to read data from. The index files help with answering that question. Having more Parts would mean having more index files as each Part has its own index file which needs to fit into memory.

getsentry / sentry

Replay: bulk delete replays from the Replay List page #42551

Problem Statement

Solution Brainstorm

Keeping this ticket for Bulk delete specifically

Possible impacts of adding more retention days

Impact on writes/INSERT’s

Impact on Zookeeper

Impact on background merges

Impact on reads/SELECT’s

Possible impacts of adding more retention days

Impact on writes/INSERT’s

Impact on Zookeeper

Impact on background merges

Impact on reads/SELECT’s