alcionai / corso

Free, Secure, and Open-Source Backup for Microsoft 365
https://corsobackup.io
Apache License 2.0
183 stars 37 forks source link

Question: One repository or multiple per tenant? #4832

Closed Integratinator closed 10 months ago

Integratinator commented 10 months ago

I am wondering about the best practice concerning repositories. I have one S3 bucket containing multiple repositories. At the moment I am using a sepperate repository (using the --prefix option) for each app in a tenant. For example: /tenant1-onedrive /tenant1-sharepoint /tenant2-exchange /tenant2-onedrive ...

Is it possible and would it be better for performance or deduplication to use the same repository for multiple apps in the same tenant? My reasoning for this is that SharePoint and Onedrive share a lot of the same data. I am thinking that this could reduce the amount of stored data. But I am also guessing that this could impact the performance because of the index growing much larger, especially in a tenant with 1000s of users and several Tbs of data.

ryanfkeepers commented 10 months ago

Hi @Integratinator! This is a great question, thanks for asking it.

When the number of resources (users, sites, etc) is only a couple hundred or fewer, Corso will perform optimally with a single repository for each microsoft tenant application. A separate repository for each microsoft app (exchange, ondrive, sharepoint, etc) can give a good balance between maximizing content deduplication and other storage performance gains.

Since you're working with thousands of resources you'll probably see better performance if you further partition your repository into smaller sets. Ex: /tenant-exchange-<partition>, A hunded or so resources per partition is probably best here.

Extra partitioning can provide two benefits: 1/ reduced storage indexes and metadata management, and 2/ (if you're running backups concurrently) reduced temporary state during backup processing. When running smaller tenants, especially sequentially, we believe the ease of use of a single repo outweighs the performance gains. But performance control is the right consideration for large tenants like yours.

If you have further questions, I'd encourage you to join our Discord server. We've got a very helpful crowd there. Plus other users who may have already created scripts around partition handling like your're doing.

Integratinator commented 10 months ago

Thank you for the extensive answer. I am a little bit surprised by it. I understand that a smaller partition would be beneficial from a performance standpoint. However, from a storage standpoint, I would think that smaller partitions would reduce the effectiveness of deduplication.

For example: I have a tenant with 3000 users (and mailboxes). Let's say that we partition them into groups of 100 users. That would mean that there are 30 partitions/repositories. And let's consider the CEO sending an email with a Christmas card as an attachment (size 120 MB) to all employees. Would it then be the case that the same email and attachment would be backed up 30 times? Once for each partition/repository? Then we would have used 3,6GB of storage instead of 120Mb if we had only one repository. Is my thinking correct, or am I missing something?

ryanfkeepers commented 10 months ago

Is my thinking correct, or am I missing something?

You are spot on. Our partitioning suggestion is based on internal analysis of the tradeoff between storage management and content deduplication.

Through our own usage of Corso we've identified that the growth of storage indexes (ie: the number of backup artifacts, as a factor of the number of resources * the history of backups for each resource) within a large repository causes greater performance impacts overall. These impacts primarily manifest as larger memory and CPU consumption due to metadata and index retrieval at runtime. Since storage capacity tends to be cheaper than CPU or memory, the gains from content deduplication do not, at scale, positively offset the runtime costs.

On the bright side, the storage layer does include compression. You wouldn't store the sum total 3.6GB. It's difficult to know how much each blob can compress. But the benefit is there.