Add alternative diff url for first branch commit

lgvalle commented 3 years ago

This PR identifies the situation where the action is running on a branch without a previous commit and provides an alternative URL to calculate the diff

This situation can happen, for example, on feature branches that only contain a single commit when the action is setup to be triggered on pull requests.

This might also be a fix for issue https://github.com/alstr/todo-to-issue-action/issues/63

lgvalle commented 3 years ago

Thanks for the reply @alstr, very valid points. Lets break them down:

1. Endpoint format: `commit/` vs `compare/`

I'm not sure if I'm looking in the right place, but when I compare the responses from these two calls it seems that in both cases the content of file is a dict. Am I missing something here? https://api.github.com/repos/alstr/todo-to-issue-action/commits/3fd212fda154273498916d48fdad504a09c6f67d vs https://api.github.com/repos/alstr/todo-to-issue-action/compare/63126d70260dc962111eb9b2293b0606971b0ed5...3fd212fda154273498916d48fdad504a09c6f67d

2. `patch` key content

Same for the patch content: in both cases, I can see the same content for patch

"patch": "@@ -269,7 +269,7 @@ class TodoParser(object):\n     LINE_NUMBERS_INNER_PATTERN = re.compile(r'@@[\\d\\s,\\-+]*\\s@@')\n     ADDITION_PATTERN = re.compile(r'(?<=^\\+).*')\n     DELETION_PATTERN = re.compile(r'(?<=^-).*')\n-    IDENTIFIER_PATTERN = re.compile(r'.+?(?=\\))')\n+    REF_PATTERN = re.compile(r'.+?(?=\\))')\n     LABELS_PATTERN = re.compile(r'(?<=labels:\\s).+')\n     ASSIGNEES_PATTERN = re.compile(r'(?<=assignees:\\s).+')\n     MILESTONE_PATTERN = re.compile(r'(?<=milestone:\\s).+')\n@@ -446,10 +446,10 @@ def _extract_issue_if_exists(self, comment, marker, code_block):\n             for line in lines:\n                 line_status, committed_line = self._get_line_status(line)\n                 cleaned_line = self._clean_line(committed_line, marker)\n-                line_title, identifier = self._get_title(cleaned_line)\n+                line_title, ref = self._get_title(cleaned_line)\n                 if line_title:\n-                    if identifier:\n-                        issue_title = f'[{identifier}] {line_title}'\n+                    if ref:\n+                        issue_title = f'[{ref}] {line_title}'\n                     else:\n                         issue_title = line_title\n                     issue = Issue(\n@@ -527,18 +527,23 @@ def _clean_line(comment, marker):\n         return comment.strip()\n \n     def _get_title(self, comment):\n-        \"\"\"Check the passed comment for a new issue title (and identifier, if specified).\"\"\"\n+        \"\"\"Check the passed comment for a new issue title (and reference, if specified).\"\"\"\n         title = None\n-        identifier = None\n-        title_pattern = re.compile(r'(?<=' + self.identifier + r'[(\\s:]).+')\n+        ref = None\n+        title_pattern = re.compile(r'(?<=' + self.identifier + r'[\\s:]).+')\n         title_search = title_pattern.search(comment, re.IGNORECASE)\n         if title_search:\n             title = title_search.group(0).strip()\n-            identifier_search = self.IDENTIFIER_PATTERN.search(title)\n-            if identifier_search:\n-                identifier = identifier_search.group(0)\n-                title = title.replace(identifier, '', 1).lstrip(':) ')\n-        return title, identifier\n+        else:\n+            title_ref_pattern = re.compile(r'(?<=' + self.identifier + r'\\().+')\n+            title_ref_search = title_ref_pattern.search(comment, re.IGNORECASE)\n+            if title_ref_search:\n+                title = title_ref_search.group(0).strip()\n+                ref_search = self.REF_PATTERN.search(title)\n+                if ref_search:\n+                    ref = ref_search.group(0)\n+                    title = title.replace(ref, '', 1).lstrip(':) ')\n+        return title, ref\n \n     def _get_labels(self, comment):\n         \"\"\"Check the passed comment for issue labels.\"\"\""

3. Extract the file name from `filename` key instead of parsing it from the diff

Seems like the right approach. I'll see if I can add it to this PR 👍

4. `commits/` response is paginated

This is a tricky one and I don't have a solution for it. If there are more than 300 files it'll paginate the response. Given this is a fallback for case where the action would fail anyways, if we move forward with this I'd suggest open a separate issue to deal with pagination

5. Remove `if os.getenv('INPUT_BEFORE') != '0000000000000000000000000000000000000000':`

Yep, should be done in this PR too

alstr commented 3 years ago

Thanks for the reply @alstr, very valid points. Lets break them down:

1. Endpoint format: commit/ vs compare/

I'm not sure if I'm looking in the right place, but when I compare the responses from these two calls it seems that in both cases the content of file is a dict. Am I missing something here? https://api.github.com/repos/alstr/todo-to-issue-action/commits/3fd212fda154273498916d48fdad504a09c6f67d vs https://api.github.com/repos/alstr/todo-to-issue-action/compare/63126d70260dc962111eb9b2293b0606971b0ed5...3fd212fda154273498916d48fdad504a09c6f67d

The difference is setting the header Accept: application/vnd.github.v3.diff when sending a request to the compare/ endpoint returns the raw diff file, which is what currently happens.

If you don't set that header it returns the dict as specified in the documentation.

lgvalle commented 3 years ago

The difference is setting the header Accept: application/vnd.github.v3.diff when sending a request to the compare/ endpoint returns the raw diff file, which is what currently happens.

If you don't set that header it returns the dict as specified in the documentation.

I've been thinking about this. If you set the application/vnd.github.v3.diff header, both requests return the same diff formatted response. So you can still extract the filename with the same regex in both cases:

curl --request GET \
  --url https://api.github.com/repos/alstr/todo-to-issue-action/commits/3fd212fda154273498916d48fdad504a09c6f67d \
  --header 'Accept: application/vnd.github.v3.diff'

 curl --request GET \
  --url https://api.github.com/repos/alstr/todo-to-issue-action/compare/63126d70260dc962111eb9b2293b0606971b0ed5...3fd212fda154273498916d48fdad504a09c6f67d \
  --header 'Accept: application/vnd.github.v3.diff'

It's not clear to me if what you are suggesting is to not send the header in any case from now on and deal with the json output. That would make it simpler to read the filename (it'll be a separate key from patch). It'll also make it easier to scale the action if there's interest in reading other keys in the future. But it's a big change.

If you want to go down this road I'll suggest doing that in a separate PR that only address the change of the parsing strategy without introducing any behaviour changes

alstr commented 3 years ago

I've been thinking about this. If you set the application/vnd.github.v3.diff header, both requests return the same diff formatted response.

Oh that's a cool find. I was not aware of that (doesn't help that it's not mentioned in the documentation) and it definitely makes things simpler. In theory, I can't see any reason why it shouldn't work. I would also assume that by retrieving the diff file directly the response would no longer be paginated. Seeing as the action only ever compares against the previous commit, we could maybe just swap out compare/ for commit/ completely?

lgvalle commented 3 years ago

Yes, entirely replace compare/ with commit/ makes sense, as the action only cares about the last commit which is precisely what commit/ returns. I've updated the PR to account for this.

There's still an issue that can happen depending on how the action is used. Let's take a branch with several commits and a TODO comment only on the first commit. If the action runs on push, this TODO comment on the branch first commit won't be detected. (It'll work fine if the action runs on merge for example, as the merge commit will contain everything)

alstr commented 3 years ago

Yes, that's a bit of a problem. It happens even when not on a separate branch; if I make three commits each with a TODO and push them together, only the one in the last commit will be added.

Here is a potential way of handling it:

If the before SHA is not `000...`

Perform the normal compare/ request, as that will pick up all TODOs across all commits.

If the before SHA is `000..`

Use the compare/ endpoint (without application/vnd.github.v3.diff) to get the list of commits and cycle through them all, getting the diff for each by calling commits/.

In theory a TODO could be added in the second commit and removed in the third commit, which could open then close an issue unnecessarily, but the action shouldn't process them if they appear as both an addition and deletion.

It sounds a bit cumbersome but perhaps one way around it.

lgvalle commented 3 years ago

If the before SHA is not 000...

Perform the normal compare/ request, as that will pick up all TODOs across all commits.

I think this won't work either. The problem, in this case, is determining the SHA to compare with. github.event.before only gives you the previous commit SHA which makes the use of compare/ and commits/ in this action virtually the same. (That's the reason we decided to move everything to commits/ with or without previous SHA)

To be able to deal with the case of three commits pushed together with a new TODO comment on each the action will need to, somehow, store the last analysed commit and compare with that.

Another alternative I can see is tweaking the action a bit to always run on Pull Requests so it can always compare against the base branch and get the full list of changes. But that's a big change out of this PR scope I'd say.

alstr commented 3 years ago

If the before SHA is not 000...

Perform the normal compare/ request, as that will pick up all TODOs across all commits.

I think this won't work either. The problem, in this case, is determining the SHA to compare with. github.event.before only gives you the previous commit SHA which makes the use of compare/ and commits/ in this action virtually the same. (That's the reason we decided to move everything to commits/ with or without previous SHA)

My understanding was that if I have 3 commits pushed at the same time:

abcd123 - 1 TODO added
efgh456 - 1 TODO added
ijkl789 - 1 TODO added

If we call commits/:

Any TODOs added in the first or second commit wouldn't be picked up, because it only shows the diff between efgh456 and ijkl789. Only 1 issue would be created.

If we call compare/:

The diff shows the TODOs added in all three commits. 3 issues would be created.

That is what my testing seems to show. Therefore if the ref is not 0000000, we can use compare/ to get all TODOs?

To be able to deal with the case of three commits pushed together with a new TODO comment on each the action will need to, somehow, store the last analysed commit and compare with that.

Another option could be having a basic text file and writing the last commit ref to it whenever the action runs? 🤔

lgvalle commented 3 years ago

I repeated your test in a separated forked repo and it worked, which confused me a lot 🙈 So I've been investigating the reason because it didn't make sense to me why would it work if comparing with the previous commit and using just commits/ should do the same thing.

The reason why it works is in what github.event.before means. I though it would point to the previous commit in the branch, but it actually refers to the previous commit pushed to origin.

That means we don't need to save the last commit anywhere, Github does it for you 😃

The consequence of all of this is that I should revert the lastest changes to this PR and leave the original approach:

If the before SHA is 000... then use commits/
Else use compare/

In any case, remove the check for 000... so the action always runs.

Do you see any potential issues with it?

alstr commented 3 years ago

The criteria for everything being okay is:

Scenario	Works?	Endpoint
Action retrieves TODOs when a single commit is pushed and the SHA is not `0000000`	✅	`compare/`
Action retrieves TODOs when a single commit is pushed and the SHA is `0000000`	✅	`commits/`
Action retrieves TODOs when two or more commits are pushed, the first of which does not have a SHA of `0000000`	✅	`compare/`
Action retrieves TODOs when two or more commits are pushed, the first of which has a SHA of `0000000`	❓	❓

In the last scenario, If we use commits/ it will obviously just get the last commit, which we don't want, but if we use compare/, will the diff include the contents of the first commit?

lgvalle commented 3 years ago

Action retrieves TODOs when two or more commits are pushed, the first of which has a SHA of 0000000

The problem in this last scenario is we don't have a commit SHA to compare with:

❌ compare/00000...193a6e fails because 00000 is not a valid commit SHA
⚠️ commits/ only give us the last commit, as you said.

So none of the options is 100% good enough for this scenario.

For this last case to work we'll need to find out the last commit of the source branch of the current branch. And compare against that. This should be feasible because you can use compare/ between SHAs and branch names. For example:

https://api.github.com/repos/lgvalle/todo-to-issue-action/compare/debug-branch-three...193a6e5a45cbdedb727e36dcdd1acde37582e3f7

This is comparing a branch debug-branch-three with a commit SHA from another branch.

What I don't know is how to obtain the base branch for an existing one 😕

alstr commented 3 years ago

I've been doing some further investigation and it looks like we can pass details of all commits in a push event by including the commits array from the push event payload.

If we add the following to action.yml:

  COMMITS:
    description: "Some commits"
    required: true
    default: "${{ toJSON(github.event.commits) }}"

We can then retrieve them in the action with os.getenv('INPUT_COMMITS'). These are only the commits in this push, which is very useful.

I just created a new repo, made 5 commits, then pushed them all. I was able to see details for all of those commits, so it would then be a matter:

Finding the SHA of the earliest commit in those pushed.
Is it 0000000? If so, iterate through each and get the diff. If not, just do compare/.

I did read something about it only including the last 20 commits and 3 tags pushed, but I think beyond that is getting too complicated anyway. 😆

lgvalle commented 3 years ago

@alstr using commits array is a great idea! I've updated the PR to make use of this for the last case. We can break it down into three cases now:

before SHA is not empty -> compare/
before SHA is empty and there is only one commit in the branch -> commits/
before SHA is empty and there are multiple commits in the branch -> find the oldest commit in the branch and use compare/

I've updated the PR to consider these cases:

         if self.before != '0000000000000000000000000000000000000000':
            # There is a valid before SHA to compare with
            diff_url = f'{self.repos_url}{self.repo}/compare/{self.before}...{self.sha}'
        elif len(self.commits) == 1:
            # There is only one commit
            diff_url = f'{self.repos_url}{self.repo}/commits/{self.sha}'
        else: 
            # There are several commits: compare with the oldest one
            oldest = sorted(self.commits, key=self.get_timestamp)[0]['id']
            diff_url = f'{self.repos_url}{self.repo}/compare/{oldest}...{self.sha}'

My internal tests seem to confirm this work in all three cases but please let me know what do you think

alstr commented 3 years ago

This looks great, thanks for that. It's actually simpler than I was anticipating; in the case of multiple commits pushed to a new repo, I was anticipating having to call commits/ on each SHA, but yes, getting the first SHA and last SHA and running compare/ is definitely better if that works!

alstr commented 3 years ago

I just added a check to make sure that self.commits != 0. The default workflow makes the action run on release as well. It doesn't do anything, but there are no commits to retrieve.

alstr / todo-to-issue-action