cli / cli

GitHub’s official command line tool
https://cli.github.com
MIT License
37.06k stars 5.71k forks source link

PR search returns duplicate results for some values of --limit #9749

Open wknapik opened 4 days ago

wknapik commented 4 days ago
% gh --version   
gh version 2.58.0 (2024-10-01)
https://github.com/cli/cli/releases/tag/v2.58.0
% gh -R brave/brave-core search prs --limit 256 --merged --merged-at ">$(date -I -d '1 month ago')" --base=master --json number -q '.[].number'|wc -l
256
% gh -R brave/brave-core search prs --limit 256 --merged --merged-at ">$(date -I -d '1 month ago')" --base=master --json number -q '.[].number'|sort -u|wc -l
200
% gh -R brave/brave-core search prs --limit 275 --merged --merged-at ">$(date -I -d '1 month ago')" --base=master --json number -q '.[].number'|wc -l
275
% gh -R brave/brave-core search prs --limit 275 --merged --merged-at ">$(date -I -d '1 month ago')" --base=master --json number -q '.[].number'|sort -u|wc -l
225
% gh -R brave/brave-core search prs --limit 300 --merged --merged-at ">$(date -I -d '1 month ago')" --base=master --json number -q '.[].number'|wc -l        
241
% gh -R brave/brave-core search prs --limit 300 --merged --merged-at ">$(date -I -d '1 month ago')" --base=master --json number -q '.[].number'|sort -u|wc -l
241
%
andyfeller commented 2 days ago

Thanks for opening this up, @wknapik, that is some weird behavior! 🤔 I'd like to share my theory and propose a few alternatives for reproducing this.

Essentially, I'm not sure wc -l or sort -u | wc -l are appropriate ways to manipulate and understand this data. It would help to preserve the output in files that we can compare and understand why they are differing.

Let's try to reproduce this

Here are some alterations I like to propose understanding what's going on:

  1. Let's pick a concrete date to look for merges

    Let's base this on the date you opened this issue: 2024-09-14

  2. Let's consider preserving the output from the command into a file so we can compare how they are different

  3. Let's use jq to count the length of the arrays rather than wc -l

    _We have to use jq rather than --jq if we want to preserve the output, should only require: ... | jq '. | length'

The result looks something like this:

# We should be able to see deviation in GitHub API responses after 10 attempts with the same arguments
for i in {1..10}; do \
    echo "Attempt $i";
    gh search prs --limit 256 --merged --merged-at ">2024-09-14" --base master  --repo brave/brave-core --json number > "cli-9749-$i.json";
done

# This is a more accurate way to look purely at the number of items
find . -name "cli-9749-*.json" -print -exec jq '. | length' {} \;

for i in {2..10}; do \
    diff <(jq 'sort' cli-9749-1.json) <(jq 'sort' "cli-9749-$i.json");
done

In theory, this will check the number of results and the actual contents.

How did this work out?

10 attempts all consistent both number of results and actual contents

andyfeller@Andys-MBP:~ $ gh version
gh version 2.58.0 (2024-10-01)
https://github.com/cli/cli/releases/tag/v2.58.0

andyfeller@Andys-MBP:~ $ for i in {1..10}; do \
    echo "Attempt $i";
    gh search prs --limit 256 --merged --merged-at ">2024-09-14" --base master  --repo brave/brave-core --json number > "cli-9749-$i.json";
done
Attempt 1
Attempt 2
Attempt 3
Attempt 4
Attempt 5
Attempt 6
Attempt 7
Attempt 8
Attempt 9
Attempt 10

andyfeller@Andys-MBP:~$ find . -name "cli-9749-*.json" -print -exec jq '. | length' {} \;

./cli-9749-2.json
256
./cli-9749-3.json
256
./cli-9749-8.json
256
./cli-9749-4.json
256
./cli-9749-5.json
256
./cli-9749-10.json
256
./cli-9749-9.json
256
./cli-9749-6.json
256
./cli-9749-7.json
256
./cli-9749-1.json
256

andyfeller@Andys-MBP:~ $ for i in {2..10}; do \
    diff <(jq 'sort' cli-9749-1.json) <(jq 'sort' "cli-9749-$i.json");
done
andyfeller@Andys-MBP:~ $

I don't believe gh search caches results of these calls, however it should be easy enough to add gh config clear-cache between gh search prs calls to ensure it clears any potential cache.

Next steps

@wknapik : I'd like you to attempt reproducing this problem while preserving the results from the gh search prs commands and sharing your thoughts on where you are seeing variants.

wknapik commented 2 days ago

Hi @andyfeller. Your code does not appear to check for duplicates, only for consistency. When I run it, I get the same results you do. There's 256 results, including duplicates, identical every time, not affected by running gh config clear-cache.

wknapik commented 2 days ago

I think we might be misunderstanding each other.

The point is that there are duplicate results. There shouldn't be any, regardless of what value is passed to --limit.

The commands I shared are meant to illustrate that for the --limit value of 256, I got 56 duplicates, for 275 I got 50 duplicates and for 300 there were no duplicates.

wknapik commented 2 days ago

It's also interesting to note that up to a point, the duplicates are added until the total number of results returned is equal to the value of --limit. Past that point, there are no duplicates anymore, regardless of the --limit value.

wknapik commented 2 days ago

cli-9749-1.json ends with

{"number":25808},{"number":25806},{"number":25805},{"number":25804},{"number":25802},{"number":25799},{"number":25798},{"number":25797},{"number":25796},{"number":25795},{"number":25794},{"number":25793},{"number":25792},{"number":25788},{"number":25785},{"number":25784},{"number":25783},{"number":25780},{"number":25779},{"number":25777},{"number":25776},{"number":25775},{"number":25774},{"number":25771},{"number":25770},{"number":25769},{"number":25768},{"number":25767},{"number":25764},{"number":25761},{"number":25760},{"number":25759},{"number":25758},{"number":25757},{"number":25752},{"number":25750},{"number":25749},{"number":25748},{"number":25747},{"number":25744},{"number":25743},{"number":25740},{"number":25738},{"number":25737},{"number":25736},{"number":25735},{"number":25734},{"number":25731},{"number":25730},{"number":25726},{"number":25721},{"number":25720},{"number":25719},{"number":25718},{"number":25717},{"number":25716}

which is a repetition of an identical sequence earlier in the file - those are the 56 duplicates