fedora-copr / copr

RPM build system - upstream for https://copr.fedorainfracloud.org/
113 stars 61 forks source link

misc: let's make copr_new_packages a lot faster #3487

Closed nikromen closed 6 days ago

nikromen commented 1 month ago

let's discuss this on planning first

FrostyX commented 1 month ago

I tried to run the original script with a profiler

python -m cProfile  misc/copr_new_packages.py --since 2024-10-01
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    311/1    0.003    0.000 1632.013 1632.013 {built-in method builtins.exec}
        1    0.001    0.001 1632.013 1632.013 copr_new_packages.py:1(<module>)
        1    0.000    0.000 1631.228 1631.228 copr_new_packages.py:109(main)
        1    0.015    0.015 1610.871 1610.871 copr_new_packages.py:30(pick_project_candidates)
     3091 1588.260    0.514 1593.087    0.515 {method 'read' of '_io.BufferedReader' objects}
     1948    0.007    0.000 1589.399    0.816 copr_new_packages.py:84(is_in_fedora)
     1948    0.023    0.000 1589.392    0.816 subprocess.py:417(check_output)
     1948    0.026    0.000 1589.365    0.816 subprocess.py:506(run)
     1948    0.020    0.000 1588.328    0.815 subprocess.py:1165(communicate)
       42    0.000    0.000   41.750    0.994 helpers.py:71(wrapper)
       32    0.002    0.000   41.716    1.304 requests.py:38(send)
       32    0.000    0.000   41.695    1.303 requests.py:49(_send_request_repeatedly)

You are right that the is_in_fedora function is the biggest waste of time. It isn't that slow but it's just called many times.

Instead of doing the proposed async thing and bombarding Koji with thousands of requests in parallel, I'd rather use something like this

fedora-repoquery rawhide "*"

to get the list of all Fedora Rawhide packages at once (it takes 5-10s) and update is_in_fedora to check the presence of the package in the list.

praiskup commented 1 month ago

TIL there's fedora-repoquery, nice.

nikromen commented 2 weeks ago

It isn't that slow but it's just called many times.

not if you stick with getting 1000 packages but that gets you max 10 days old packages at best, so the pool to choose from is really thin. If you want to cover everything from the latest fedora magazine to today, you need to go with more packages than 1000 (e.g. 10k), which takes ages.

fedora-repoquery rawhide "*"

really nice, I didn't know about this feature! This is even better and simpler

1000 before:

time python misc/copr_new_packages.py --since 2024-03-01

.
.
.

________________________________________________________
Executed in  268.26 secs    fish           external
   usr time   57.93 secs    0.00 micros   57.93 secs
   sys time    7.46 secs  392.00 micros    7.46 secs

1000 after:

time python misc/copr_new_packages.py --since 2024-03-01

.
.
.

________________________________________________________
Executed in   27.79 secs    fish           external
   usr time    2.33 secs    0.00 micros    2.33 secs
   sys time    0.26 secs  396.00 micros    0.26 secs

10k before (hours -> i don't want to do that)

10k after:

time python misc/copr_new_packages.py --since 2024-03-01 --limit 10000

.
.
.

________________________________________________________
Executed in  500.43 secs    fish           external
   usr time   21.80 secs    0.00 micros   21.80 secs
   sys time    1.44 secs  433.00 micros    1.44 secs

pls try :pray: