Closed rickvdbosch closed 4 years ago
@CrypticEngima Are you able to look into this?
@rickvdbosch Currently we are not expecting the usage to hit those limits. but if we do start getting that amount of traffic we can look at refactoring this. in prep for if that happens could you please describe the changes you propose to make for this(I'm personally interested as this is the first time i'm using table storage).
@CrypticEngima As far as the way I'm used to work with TableStorage, you could take a look at my TableStorageRepository for reference. Might be interesting.
For the entities, I would think about the following: | Table | Partition Key | Row Key |
---|---|---|---|
Users | "Users" | Username | |
Repositories | Username | Reponame | |
PullRequests | Username + Reponame | PrId |
There's a downside here since you need to do multiple queries to get all information. But with proper partitioning that shouldn't be a big / an actual issue.
@rickvdbosch Thank you so much for sharing that information I can most certanly see the benifits of this structure. I have one overriding question about the format you suggest here though which is.
Does this format not turn a key value pair storage into a basic Relational Database?
Maybe i'm missunderstanding the useage of ''no sql' style storage i'm so used to using Relational Databases.
Well, the current structure does the same, but only by serializing data instead of having it in separate tables. 😁
Looking at this from an API perspective, there are some clear entry points to be seen.
This would validate the structure, since you're going to need to call1 before calling 2. Us using MVC might drive us to think we'd need all the data at once for our model.
Come to think about it, maybe the user table is not even needed. It doesn't store anything else than username... right? So having username as the PK of the repos table eliminates that one. And to be honest I'm not entirely sure about the repositories table either.
That would solve the issue entirely 🤓
So I took the time to play a game of tennis, and CrypticEngima's comment and the relaxation gave me some new insights. Nothing in this comment is meant as criticism, only to get us to the best solution. So here goes:
The proposal in my earlier comment in this thread was based on an existing model, which actually seems set up with a relational model in mind. But I think we might need to take a step back in defining the model.
What we should do first is define what data we actually need to store. The user-table, for instance, can be removed since the only thing we store is the username. That's something we can store elsewhere.
Next we need to take a look at the levels at which we want to retrieve that data. Because if we always get all repositories and the PR's the user has for those repos, the model can be brought back to only one table. That's the cool thing about Table Storage: it's so fast and cheap it's not bad to store things multiple times. Normalization is not that important anymore.
PartitionKey | RowKey | Column | Column | |
---|---|---|---|---|
Structure | Username | {owner}:{reponame}:{prId} | Url | Title |
Example | rickvdbosch |
Layla-P:HacktoberfestProject:19 |
https://github.com/Layla-P/HacktoberfestProject/pull/19 |
Get user from table storage based on GitHub info |
This enables us to get all information for a specific user by querying the entire partition for a user. The current combined RowKey is unique and can be parsed into three different columns owner
, reponame
and PrId
.
As a sidestep: we can generate the URL based on that information. So to be efficient we could remove Url too. But if it's simpler to keep it, then we should.
Any ideas @Layla-P and @CrypticEngima?
@rickvdbosch Thanks, I see the way your thinking about this now and yes it's a big change from the way you think about data in a relational database. I think i need to investigate ' No Sql' style further to better understand. but this information has been a real eye opener
In the current entity setup, repos and the PRs of that repo a user contributed to are serialized into a Json string and stored in one Table Storage column. The maximum length of one column in Table Storage is 64 KiB:
Because of this limit, the current structure might be insufficient for (very) active users.
I propose to implement an alternative structure to make sure we can accommodate even the most active GitHub users. Is that OK?