aswinnnn / pyscan

python dependency vulnerability scanner, written in Rust.
MIT License
184 stars 6 forks source link

Use batch API for OSV #5

Closed sarimak closed 1 year ago

sarimak commented 1 year ago

pyscan is very slow for a repo with 429 3rd-party packages (which is claimed to be the #1 feature of pyscan according to the homepage). If I am not mistaken, it makes one HTTP request to the OSV API per scanned packages despite the fact that we know all scanned packages and versions in advance. OSV as a batch API which could be leveraged to make pyscan IMO much faster: https://google.github.io/osv.dev/post-v1-querybatch/

BTW: What is the added value of Rust in app that just parses a text file, makes a HTTP call and formats the results? Compiling the pyscan-rs takes ages and perhaps a pure Python code could be fast enough?

aswinnnn commented 1 year ago

pyscan is very slow for a repo with 429 3rd-party packages

pyscan has 10 dependencies/crates, as visible in the cargo.toml. These dependencies have their own dependencies, which results in: 160 crates for Linux, 167 crates for windows. This is not uncommon for a Rust project. I'm not sure where you are picking up 429 "packages" but regardless of it, pyscan uses the most common crates any Rust dev would have used before. It might seem hard for you to acclimatize to, but it gets better over time.

Thanks for letting me know about the batch query, it is better, and I figured sooner or later it would have been the better choice. It is currently being developed and will be released in the next version.

BTW: What is the added value of Rust in app that just parses a text file, makes a HTTP call and formats the results? Compiling the pyscan-rs takes ages and perhaps a pure Python code could be fast enough?

I understand the sentiment. Pyscan is in its alpha stage and hasn't been through the necessary optimizations any established project would have. The main idea behind it was having a single binary capable of executing what Pyscan claims to do, instead of having to depend on the user having a python runtime. It's useful in terms of a CI where you want to minimize the number of things you install, and I provide releases just for that and other cases.

I don't think language here has any relevance, only implementation. Changing to batch request is a priority and appreciate the concern.

sarimak commented 1 year ago

By the 429 3rd-party packages I meant the number of lines in my project's requirements.txt -- sorry for the confusion. So the (runtime) slowness of pyscan I encountered was most likely caused by the 429 remote API calls when scanning for the vulnerabilities in my project.

sarimak commented 1 year ago

I get the point -- I just would not be that afraid of the things installed into the CI image because once a project starts using something like pre-commit, it becomes necessary to install Python and some Python packages into your CI image anyway.

At least that's where we ended up at work. Where I really appreciate Rust's speed is when running the pre-commit hooks (flake8 -> Ruff is a huge step forward - AST parsing is CPU-bound).

aswinnnn commented 1 year ago

I understand that Python is probably on every developer's system, but I don't think it's worth changing languages now, lol. This project is experimental atm, and It's my first time messing with Rust as well, so improvements are expected over time as I get better. It's nice to have input from someone who might need it for work, though. Appreciate that. And yeah, those API calls should be replaced ASAP. Working on it right now. Hopefully I'll be able to optimize the speed to a better level in the future.

aswinnnn commented 1 year ago

Hey @sarimak batched API is the default way of doing things now. Can you test it out on your big requirements file again? I tried it out with 230+ packages, and it took about 20 seconds to complete. Figure it might take double on yours, curious to see the result though