Layla-P / HacktoberfestProject

Log your Hacktoberfest .NET PRs at prtracker.net
Other
8 stars 17 forks source link

Current structure might hit limits #16

Closed rickvdbosch closed 4 years ago

rickvdbosch commented 4 years ago

In the current entity setup, repos and the PRs of that repo a user contributed to are serialized into a Json string and stored in one Table Storage column. The maximum length of one column in Table Storage is 64 KiB:

String values may be up to 64 KiB in size. Note that the maximum number of characters supported is about 32 K or less.
Source: Understanding the Table service data model - Property types

Because of this limit, the current structure might be insufficient for (very) active users.

I propose to implement an alternative structure to make sure we can accommodate even the most active GitHub users. Is that OK?

Layla-P commented 4 years ago

@CrypticEngima Are you able to look into this?

CrypticEnigma00 commented 4 years ago

@rickvdbosch Currently we are not expecting the usage to hit those limits. but if we do start getting that amount of traffic we can look at refactoring this. in prep for if that happens could you please describe the changes you propose to make for this(I'm personally interested as this is the first time i'm using table storage).

rickvdbosch commented 4 years ago

@CrypticEngima As far as the way I'm used to work with TableStorage, you could take a look at my TableStorageRepository for reference. Might be interesting.

For the entities, I would think about the following: Table Partition Key Row Key
Users "Users" Username
Repositories Username Reponame
PullRequests Username + Reponame PrId

There's a downside here since you need to do multiple queries to get all information. But with proper partitioning that shouldn't be a big / an actual issue.

CrypticEnigma00 commented 4 years ago

@rickvdbosch Thank you so much for sharing that information I can most certanly see the benifits of this structure. I have one overriding question about the format you suggest here though which is.

Does this format not turn a key value pair storage into a basic Relational Database?

Maybe i'm missunderstanding the useage of ''no sql' style storage i'm so used to using Relational Databases.

rickvdbosch commented 4 years ago

Well, the current structure does the same, but only by serializing data instead of having it in separate tables. 😁

Looking at this from an API perspective, there are some clear entry points to be seen.

  1. Get repos per user
  2. Get PRs for a repo (of a user)

This would validate the structure, since you're going to need to call1 before calling 2. Us using MVC might drive us to think we'd need all the data at once for our model.

Come to think about it, maybe the user table is not even needed. It doesn't store anything else than username... right? So having username as the PK of the repos table eliminates that one. And to be honest I'm not entirely sure about the repositories table either.

That would solve the issue entirely 🤓

rickvdbosch commented 4 years ago

So I took the time to play a game of tennis, and CrypticEngima's comment and the relaxation gave me some new insights. Nothing in this comment is meant as criticism, only to get us to the best solution. So here goes:

The current solution

The proposal in my earlier comment in this thread was based on an existing model, which actually seems set up with a relational model in mind. But I think we might need to take a step back in defining the model.

Requirements

What we should do first is define what data we actually need to store. The user-table, for instance, can be removed since the only thing we store is the username. That's something we can store elsewhere.
Next we need to take a look at the levels at which we want to retrieve that data. Because if we always get all repositories and the PR's the user has for those repos, the model can be brought back to only one table. That's the cool thing about Table Storage: it's so fast and cheap it's not bad to store things multiple times. Normalization is not that important anymore.

Proposal (beware, based on assumptions above)

PartitionKey RowKey Column Column
Structure Username {owner}:{reponame}:{prId} Url Title
Example rickvdbosch Layla-P:HacktoberfestProject:19 https://github.com/Layla-P/HacktoberfestProject/pull/19 Get user from table storage based on GitHub info

This enables us to get all information for a specific user by querying the entire partition for a user. The current combined RowKey is unique and can be parsed into three different columns owner, reponame and PrId.
As a sidestep: we can generate the URL based on that information. So to be efficient we could remove Url too. But if it's simpler to keep it, then we should.

Input or ideas?

Any ideas @Layla-P and @CrypticEngima?

CrypticEnigma00 commented 4 years ago

@rickvdbosch Thanks, I see the way your thinking about this now and yes it's a big change from the way you think about data in a relational database. I think i need to investigate ' No Sql' style further to better understand. but this information has been a real eye opener