chaoss / grimoirelab-perceval

Send Sir Perceval on a quest to retrieve and gather data from software repositories.
http://perceval.readthedocs.io/
GNU General Public License v3.0
290 stars 177 forks source link

[Git] Add support for git patch-id #851

Open jgbarah opened 1 week ago

jgbarah commented 1 week ago

git patch-id provides a sort-of unique ids for the information in a commit. It is basically "a sum of SHA-1 of the file diffs associated with a patch, with line numbers ignored". This means that the same commit, when for example cherry-picked to a different repository or to a different branch, or rebased, etc., keeps the same "patch-id". This is very useful to track the same commit when it travels to different repos (for example, all repos in the Linux kernel hierarchy), or to keep track of the commit when it is rebased o cherry-picked in any way.

I'm not sure which way to compute it would be better. Likely, using git itself could be the easiest one. But it can also be computed directly from the diff, which maybe is more aligned with the way in which Perceval works.

Another option could be to add some option to the backend to collect the diff, and then compute the patch-id in a separate step, by some other tool, after Perceval finishes its work. But that could cause a lot of information (all the diff data) to be produced by Perceval, which if you are only interested in the patch-id would just cause Perceval to produce much more data than needed, and maybe make it impractical for large repos.

sduenas commented 5 days ago

We should run some tests and check the performance of this. We can get the diff of every commit with the option -p. So, something like git log --raw -p --full-diff should do the work. With the diff we can calculate the patch-id. However, this can work for the first time the commits are fetched. For newest commits we use git show that apparently, can be combined with git patch-id.