Open paolobarbolini opened 3 months ago
While analyzing crates I caught an error and got stuck trying to fetch submodules from https://github.com/quickwit-oss/tantivy/tree/dff022b30aff6bcd4df7e908f6fa2f86e551204b because it used git over SSH. I guess GIT_TERMINAL_PROMPT=0
isn't enough.
I've opened #12 to stop missing repositories from continuously blocking the clone process. This patch is already present on the machine I'm doing the scanning from.
The clone process was interrupted by https://github.com/ntex-rs/ntex/pull/333 :sweat_smile:. I've applied a patch locally for now and I'll see how to fix it permanently. Turns out with a large enough pool of crates almost every ensure!
will probably get hit at some point
While analyzing crates I caught an error and got stuck trying to fetch submodules from https://github.com/quickwit-oss/tantivy/tree/dff022b30aff6bcd4df7e908f6fa2f86e551204b because it used git over SSH. I guess
GIT_TERMINAL_PROMPT=0
isn't enough.
It just happened again with the investments
crate
I'm starting to analyze the logs, which I'll publish once we finish analyzing all crates. I've already encountered something, which I've reported at https://github.com/rust-db/refinery/issues/323
I've also opened https://github.com/TimelyDataflow/timely-dataflow/issues/559
Not sure if it is in top 20k crates, but here is a pgp
crate issue: https://github.com/rpgp/rpgp/issues/327
EDIT: pgp is in top 20k
Processing got stuck at ~18.5k crates. Heres the log: output.log.gz WARNING: I've already verified that there are a lot of false positives.
I'll merge #18, #19 and #20 locally and have it re-run on all crates
pgp is in top 20k
Don't worry about the 20k limit, it's just a number I've picked for doing the "official" scrape after having done a very rough 5k one in the previous days :smiley:
Issue for brotli, have not tested if it is reproducible: https://github.com/dropbox/rust-brotli/issues/178
async-rusqlite PR to add repository
: https://github.com/jsdw/async-rusqlite/pull/2
Here are the results from the second run: output.log.gz
There are too many crates without a repository field. I'd like to start opening issues on crates that have recently released new versions, which are the ones more likely to respond. I wrote this very rough scraper for finding out the last updated date of each crate in the list
[package]
name = "cargo-recent-crates"
version = "0.1.0"
edition = "2021"
[dependencies]
reqwest = { version = "0.12", default-features = false, features = ["rustls-tls", "json", "blocking"] }
chrono = { version = "0.4", features = ["serde"] }
serde = { version = "1", features = ["derive"] }
use std::{thread, time::Duration};
use chrono::{DateTime, Utc};
fn main() {
let mut client = reqwest::blocking::Client::builder().user_agent("https://github.com/M4SS-Code/cargo-goggles/issues/11 scraping crates with no repository field").build().unwrap();
let crates = [
// put crates here
];
#[derive(Debug, serde::Deserialize)]
struct C {
#[serde(rename = "crate")]
c: Cr,
}
#[derive(Debug, serde::Deserialize)]
struct Cr {
updated_at: DateTime<Utc>,
}
for k in crates {
for _ in 0..3 {
let j = match client
.get(format!("https://crates.io/api/v1/crates/{k}"))
.send()
{
Ok(r) => match r.json::<C>() {
Ok(j) => j,
Err(err) => {
eprintln!("{err:?}");
thread::sleep(Duration::from_secs(5));
continue;
}
},
Err(err) => {
eprintln!("{err:?}");
thread::sleep(Duration::from_secs(5));
continue;
}
};
println!("{k}\t{}", j.c.updated_at.to_rfc3339());
break;
}
thread::sleep(Duration::from_secs(2));
}
}
Maybe also make a post on Mastodon with #rust
and #rustlang
tags asking maintainers to add repository
field?
Then at least some will set it before you have to make an issue.
Maybe also make a post on Mastodon with
#rust
and#rustlang
tags asking maintainers to addrepository
field? Then at least some will set it before you have to make an issue.
Sounds like a good idea.
In the meantime here's the list (it's actually .tsv but GitHub didn't like it): crates.csv
I haven't posted it on Twitter or Mastodon yet, or seen if cargo
could make it more obvious when the repository
field is missing, but I did open a few issues on projects I recognized from the list and I've gotten this response ^1, which is an interesting wake-up call [^2].
I'm not sure opening issues this way is doable at this point, for once we're still just 3 people playing with our toys figuring out what to do with them. I think I'll dedicate more time on the development side to get something much more usable than the current version and see this can also help others, be it in a CLI or library form.
[^2]: Full disclosure we're not going to monetize this, but there are other benefits we could enjoy as a company like the publicity.
I've scraped (I'm lazy, I should have used the database dumps) the top 20k crates by recent downloads count. I've published the list and the script at https://gist.github.com/paolobarbolini/b5101b3ad378bcb6bc5c282349edfd4c.
I'll soon be getting a server from Hetzner with 320 GB of disk and see if I can go through the entire list without running out of disk space. I'll also use the list as a way of fixing some of the shortcomings which have been reported in other issues.
⚠️⚠️⚠️ WARNING ⚠️⚠️⚠️
Before you open issues in the projects you think are affected, investigate the reports thoroughly. This software is still v0.0.1 for a very good reason.