AmrDeveloper / GQL

Git Query language is a SQL like language to perform queries on .git files with supports of most of SQL features such as grouping, ordering and aggregations functions
https://amrdeveloper.github.io/GQL/
MIT License
3.29k stars 90 forks source link

Querying diffs is very slow on moderately large repositories #124

Open mplanchard opened 1 month ago

mplanchard commented 1 month ago

Describe the bug

Queries on diffs for even moderately large repositories are incredibly slow. Our repository at work has ~5,500 commits.

The following operation to get the diff with the most deletions took ~30 minutes:

❯ time .cargo/bin/gitql --query 'select * from diffs order by deletions desc limit 1'
╭──────────────────────────────────────────┬───────────────────┬───────────────────────┬────────────┬───────────┬───────────────┬─────────────────────────┬───────────────────────────────────╮
│ commit_id                                ┆ name              ┆ email                 ┆ insertions ┆ deletions ┆ files_changed ┆ datetime                ┆ repo                              │
╞══════════════════════════════════════════╪═══════════════════╪═══════════════════════╪════════════╪═══════════╪═══════════════╪═════════════════════════╪═══════════════════════════════════╡
│ 8b685201464c3027afe9105bb5ed9b40a1befce7 ┆ Matthew Planchard ┆ msplanchard@gmail.com ┆ 3284       ┆ 41552     ┆ 212           ┆ 2024-08-15 18:15:45.000 ┆ /home/matthew/s/spec/.git         │
╰──────────────────────────────────────────┴───────────────────┴───────────────────────┴────────────┴───────────┴───────────────┴─────────────────────────┴───────────────────────────────────╯

________________________________________________________
Executed in   27.37 mins    fish           external
   usr time   27.25 mins  569.00 micros   27.25 mins
   sys time    0.04 mins    0.00 micros    0.04 mins

During the entire time, a single thread was pretty much pegged. I can get this same result using git and awk in a fraction (1/270th, 0.37%) of the time:

❯ time git log --pretty="@%h" --shortstat | tr "\n" " " | tr "@" "\n" | awk '{if ($7 > deletions) { deletions = $7; commit = $1 }}; END { print commit; print deletions }' 
8b6852014
41720

________________________________________________________
Executed in    6.01 secs    fish           external
   usr time    5.41 secs    0.00 millis    5.41 secs
   sys time    0.63 secs    1.78 millis    0.63 secs

Queries on commits seem to run in a more reasonable amount of time, e.g.:

❯ time .cargo/bin/gitql --query "select count(author_name) from commits where author_name like '%matthew%'"
╭──────────╮
│ column_2 │
╞══════════╡
│ 1001     │
╰──────────╯

________________________________________________________
Executed in  357.45 millis    fish           external
   usr time  351.94 millis    0.00 micros  351.94 millis
   sys time    4.62 millis  641.00 micros    3.98 millis

To Reproduce

  1. Check out any large repo
  2. Run the example command above

Expected behavior Speed is at least within an order of magnitude of git/awk

GQL (please complete the following information): GitQL version 0.28.0

Additional context Add any other context about the problem here.

AmrDeveloper commented 1 month ago

Hello @mplanchard,

I am totally agree with you that diffs table should be faster and this can fixed using many ways

But now i am thinking to work step by step to get more optimisation an cover more features in general then moving to optimize specific cases.

But after those features i think we can get the ability to perform more customisable and faster queries

Thank you, Amr