Backup new created issue

SimonLab commented 6 years ago

As a user I want to save new created issue So that I can access a backup version of the Github issues.

linked to #31 (How to save a json file on S3 with Phoenix) and #5 (What kind of issue information we want to backup?)

Steps

[x] Catch the new issue webhook event:
- add a new catch case for the :issue_created type of event, see: https://github.com/dwyl/github-backup/blob/73cafc01a5ff915a39bfeebfe986d7e0f25f5d2a/lib/app_web/controllers/event_type.ex#L30 https://github.com/dwyl/github-backup/blob/73cafc01a5ff915a39bfeebfe986d7e0f25f5d2a/lib/app_web/controllers/event_controller.ex#L12
[x] Create a simple version of the issue, comment and revision schemas and tables: issue: id (unique on Github), number, html_url comment: id, issue_id, revisions: id, id_comment

The user schema/table and the reference to the user on the revision table can be added later on
[x] Save the {id_isssue}.json file on S3 with the following structure: {revision_id: "content of the comment"},...

SimonLab commented 6 years ago

I think we still need to clarify the structure of the tables especially the revisions table(s) (we mentioned this on this comment: https://github.com/dwyl/github-backup/issues/5#issuecomment-367639710

Github is doing a distinction between the "first comment" which is the issue description and the rest of the comments: github_comments

Github give a unique id to the issue and to the comments: issue:

comments (just one here):

Ideally we would have to create only one revision table which save all the history of edits on the issue (first comment) and the comments. We would have use the id of the issue or comments to reference the revisions to the correct comments. The problem is that I think the ids are not unique between issues and comments.

So I still think that we might need to create two revisions tables one for the edits on the issue (first comment) and one for the edits on the comments themselves.

Here is a the flow:

Issue is created -> save meta data in the issue table (id, title html_url...) -> create new revision in the revision_issue with (id, id_issue, timestamp) -> create file on S3 with {id_issue}.json which will have the folllowing structure:
```
{
"issue": {"revision_1": "description issue", "revision_2": "description issue edited",...},
"comments": {"revision_1": "comment", "revision_2": "comment edited",...}
}
```
Issue edited -> create new revision_issue item -> update the s3 file by adding updating the object linked to the "issue" key
Comment added or edited: add a new entry into the revision_comment table -> update the s3 file to update the object on the "comments" key

We can see that the postgres database will keep track of the "timeline of the edit" and the s3 file will only contains the edit and changes of the issue and comments.

@nelsonic @Cleop does this makes sense? So to recap I think I still prefer to create 2 different tables to store the changes of the issue and comments due to the way Github split the two on the api. But if you think another way can work or it's easier to impement let me know :+1:

nelsonic commented 6 years ago

@SimonLab that is reasonable ... 🤔

iteles commented 6 years ago

That's odd, I thought each comment (including the issue description) had a totally unique ID. If this is the case then I reckon this is a good way forward for now.

nelsonic commented 6 years ago

Easy to check ... My original comment (without edit: https://github.com/dwyl/github-backup/issues/37#issuecomment-369579067 Added Emoji: https://github.com/dwyl/github-backup/issues/37#issuecomment-369579067 Still has the same comment id ... @SimonLab perhaps we could include a timestamp in the id we store e.g:


{
  "comments": {
    "369579067_1519907431": "@SimonLab that is reasonable",
    "369579067_1519908529": "@SimonLab that is reasonable ... 🤔"
  }
}

SimonLab commented 6 years ago

@iteles from my previous screenshots we can see that the id of the comment 369565032 > 301369661 the id of the issue and that they have the same format it's difficult to know that the ids are unique and don't overlap. Github might check that the id is not already taken by an issue when a new comment is created but I couldn't find any information about the uniqueness of the ids. That's why I think the issue id might happend to be the same as a comment id on some occasion.

@nelsonic we could try to make the issue and comment ids unique by adding a timestamp but I'm not sure how it can help us.

I think at this stage I'll try to create the two revision tables and while I implement this solution I might find a better way to structure the database

Cleop commented 6 years ago

Hang on I think I'm confused...

Maybe ignore this comment---- Or what I had envisaged in my schema (https://github.com/dwyl/github-backup/issues/5#issuecomment-367417914) was:

you assign each comment an id in the db which is its unique id (like @nelsonic's two versions of the same comment, each has a unique value)
then to show that they're essentially two versions of the same comment that exists in one place on github that was the field I called 'Github-description ID' which would compromise of issue number and comment number (ie number from the top of the issue downwards) which is what would be used to link the two versions of the comments to the fact they are referring to different versions of the same comment.

If we had these two values then I think everything could be done under one 'comments' table

nelsonic commented 6 years ago

@SimonLab I have asked the question about comment id uniqueness on StackOverflow: https://stackoverflow.com/questions/49052928/are-github-issue-comment-ids-unique

For now I think it's "safe" to assume that issue id and comment id are an auto-increment INT in GitHub's MySQL DB. i.e. unique. (I would "bet a million bucks" that issue.id and comment.id are unique i.e. never re-used) however this should not matter to us because we are not going to make our issues.id or comments.is an auto-increment field.

To be clear I was suggesting using a timestamp in our JSON key (above) because the comment id does not change when someone edits their comment. e.g: my comment above has an id of 369579067 regardless of how many times I edit it ... therefore we should store it as 369579067_{timestamp} to easily keep track of the revisions. This will ensure that the older revisions are have a lower (timestamp) number and can be sorted accordingly.

Cleop commented 6 years ago

So...

a user opens an issue and the issue id (and other issue meta data) is stored in the Issues table.
The 'description' of the issue that they write (or what you could call the 1st comment on a page) is added into the Comments table along with any other meta data relating to the comment (the copy of the first comment is not added, I just mean that a row is added to symbolise the place of that comment, the copy itself is dealt with in the json file).

But wait, hold on, github doesn't give the first comment of an issue a 'comment id'? In these cases we assign the ID 'issue id_1' so we know it's the first comment of that issue. In all other cases we use the github generated comment id.

At the same time in the Versions table we create a row for that version of that comment (in this case the first version of the description of the issue). In this table we give the row the value of the comment ID so we know which comment this is a version of. We can also include meta data here which might include a timestamp or connect with data about the user that wrote that version of the comment.

Finally we want to save the copy/text of the description into a json file that has a name and structure that makes it easy to access whenever we want it. At the moment we've discussed structuring our json files by issue so that each one contains all of the comments (and all of their respective versions) relating to one issue within one file. Our json file would look like this:

The name would be: issue id.json
It would then contain keys that would correspond with the version ID which signifies which version of a comment the text is from.

This would be our basic structure. If we then wanted to organise our versions of text by issue or by comment (to display on the front end) then we could do so using the version ID and then querying our DB based on the other fields that were particular to what it is we were trying to achieve.

I hope this makes some sense! It's only covering this specific bit of flow so we can talk more about user tables etc in more detail on the call or at a later date based on priority. We can also talk through this flow here in more detail on the stand up tomorrow as sometimes spoken word is easier to understand.

Cleop commented 6 years ago

This issue, #15 and #5 are all linked and crossing over. I wouldn't say they're 100% duplicates but I think it's a bit confusing having the schema question in 3 places. Can we consolidate it to one issue? @SimonLab and @nelsonic - what do you think?

nelsonic commented 6 years ago

@Cleop agreed. 👍

dwyl / github-backup

Backup new created issue #37

Steps