M4SS-Code / cargo-goggles

Verify that registry crates in your Cargo.lock are reproducible from the git repository
https://crates.io/crates/cargo-goggles
Apache License 2.0
36 stars 2 forks source link

crates.io top 20k crates verification process #11

Open paolobarbolini opened 3 months ago

paolobarbolini commented 3 months ago

I've scraped (I'm lazy, I should have used the database dumps) the top 20k crates by recent downloads count. I've published the list and the script at https://gist.github.com/paolobarbolini/b5101b3ad378bcb6bc5c282349edfd4c.

I'll soon be getting a server from Hetzner with 320 GB of disk and see if I can go through the entire list without running out of disk space. I'll also use the list as a way of fixing some of the shortcomings which have been reported in other issues.

⚠️⚠️⚠️ WARNING ⚠️⚠️⚠️

Before you open issues in the projects you think are affected, investigate the reports thoroughly. This software is still v0.0.1 for a very good reason.

paolobarbolini commented 3 months ago

While analyzing crates I caught an error and got stuck trying to fetch submodules from https://github.com/quickwit-oss/tantivy/tree/dff022b30aff6bcd4df7e908f6fa2f86e551204b because it used git over SSH. I guess GIT_TERMINAL_PROMPT=0 isn't enough.

paolobarbolini commented 3 months ago

I've opened #12 to stop missing repositories from continuously blocking the clone process. This patch is already present on the machine I'm doing the scanning from.

paolobarbolini commented 3 months ago

The clone process was interrupted by https://github.com/ntex-rs/ntex/pull/333 :sweat_smile:. I've applied a patch locally for now and I'll see how to fix it permanently. Turns out with a large enough pool of crates almost every ensure! will probably get hit at some point

paolobarbolini commented 3 months ago

While analyzing crates I caught an error and got stuck trying to fetch submodules from https://github.com/quickwit-oss/tantivy/tree/dff022b30aff6bcd4df7e908f6fa2f86e551204b because it used git over SSH. I guess GIT_TERMINAL_PROMPT=0 isn't enough.

It just happened again with the investments crate

paolobarbolini commented 3 months ago

I'm starting to analyze the logs, which I'll publish once we finish analyzing all crates. I've already encountered something, which I've reported at https://github.com/rust-db/refinery/issues/323

paolobarbolini commented 3 months ago

I've also opened https://github.com/TimelyDataflow/timely-dataflow/issues/559

link2xt commented 3 months ago

Not sure if it is in top 20k crates, but here is a pgp crate issue: https://github.com/rpgp/rpgp/issues/327 EDIT: pgp is in top 20k

paolobarbolini commented 3 months ago

Processing got stuck at ~18.5k crates. Heres the log: output.log.gz WARNING: I've already verified that there are a lot of false positives.

I'll merge #18, #19 and #20 locally and have it re-run on all crates

paolobarbolini commented 3 months ago

pgp is in top 20k

Don't worry about the 20k limit, it's just a number I've picked for doing the "official" scrape after having done a very rough 5k one in the previous days :smiley:

link2xt commented 3 months ago

Issue for brotli, have not tested if it is reproducible: https://github.com/dropbox/rust-brotli/issues/178 async-rusqlite PR to add repository: https://github.com/jsdw/async-rusqlite/pull/2

paolobarbolini commented 3 months ago

Here are the results from the second run: output.log.gz

paolobarbolini commented 3 months ago

There are too many crates without a repository field. I'd like to start opening issues on crates that have recently released new versions, which are the ones more likely to respond. I wrote this very rough scraper for finding out the last updated date of each crate in the list

[package]
name = "cargo-recent-crates"
version = "0.1.0"
edition = "2021"

[dependencies]
reqwest = { version = "0.12", default-features = false, features = ["rustls-tls", "json", "blocking"] }
chrono = { version = "0.4", features = ["serde"] }
serde = { version = "1", features = ["derive"] }
use std::{thread, time::Duration};

use chrono::{DateTime, Utc};

fn main() {
    let mut client = reqwest::blocking::Client::builder().user_agent("https://github.com/M4SS-Code/cargo-goggles/issues/11 scraping crates with no repository field").build().unwrap();

    let crates = [
        // put crates here
    ];

    #[derive(Debug, serde::Deserialize)]
    struct C {
        #[serde(rename = "crate")]
        c: Cr,
    }

    #[derive(Debug, serde::Deserialize)]
    struct Cr {
        updated_at: DateTime<Utc>,
    }

    for k in crates {
        for _ in 0..3 {
            let j = match client
                .get(format!("https://crates.io/api/v1/crates/{k}"))
                .send()
            {
                Ok(r) => match r.json::<C>() {
                    Ok(j) => j,
                    Err(err) => {
                        eprintln!("{err:?}");
                        thread::sleep(Duration::from_secs(5));
                        continue;
                    }
                },
                Err(err) => {
                    eprintln!("{err:?}");
                    thread::sleep(Duration::from_secs(5));
                    continue;
                }
            };

            println!("{k}\t{}", j.c.updated_at.to_rfc3339());

            break;
        }

        thread::sleep(Duration::from_secs(2));
    }
}
link2xt commented 3 months ago

Maybe also make a post on Mastodon with #rust and #rustlang tags asking maintainers to add repository field? Then at least some will set it before you have to make an issue.

paolobarbolini commented 3 months ago

Maybe also make a post on Mastodon with #rust and #rustlang tags asking maintainers to add repository field? Then at least some will set it before you have to make an issue.

Sounds like a good idea.

In the meantime here's the list (it's actually .tsv but GitHub didn't like it): crates.csv

paolobarbolini commented 3 months ago

I haven't posted it on Twitter or Mastodon yet, or seen if cargo could make it more obvious when the repository field is missing, but I did open a few issues on projects I recognized from the list and I've gotten this response ^1, which is an interesting wake-up call [^2].

I'm not sure opening issues this way is doable at this point, for once we're still just 3 people playing with our toys figuring out what to do with them. I think I'll dedicate more time on the development side to get something much more usable than the current version and see this can also help others, be it in a CLI or library form.

[^2]: Full disclosure we're not going to monetize this, but there are other benefits we could enjoy as a company like the publicity.