Closed serapath closed 2 years ago
Having worked on the scraper for a while now it is pretty stable and extends out to gitlab repositories and is a bit more stable able people attached to a project. You can find the current version here (not published to npm).
Technology stuff:
I fear the output data format doesn't match the specification quite yet: 2022-06-21T201258_912Z.zip
sweet :-) will check in detail later. sounds great. i guess if we can run it from time to time and commit the different timestamped versions we get to a public repo so static pages can pull it from there to do interesting things with it would be cool.
thank you so much :smiley:
ok, checked the code. i guess it is fine. it seems to work different than what i expected, but there are many ways to write things. I was hoping it would work without any API tokens, so anyone could just clone and run the thing if they wanted to be honest, without signing up to github and creating an API token, but at least we have it now, thats cool :-)
also, it seems the zip file produces a packages.json
and repos.json
, but it is not clear what is or can be in those?
I checked your example output, which seems to include packages.json
and repos.json
which to me feels like the same thing.
packages.json
has per package:
repos.json
git url
(hopefully unique identifier - would be cool to use it to map to corresponding packages entries)package
...i guess additional information about the package?
key
in the map (object) would uniquely identify itpackage
is also an array ... which is that? is that because of multirepo support? does anyone do that?dependencies
:+1: dependents
:+1: ...by the way, here they are listed as more canonical urls which i likecontributors
...is that the same as people or different? now we have it twice?I think the reason i did not include person/people
is because it was unclear how to deal with them and it seemed like lots of work and now we have people
and contributors
and its not standardized and ideally would have of course similar to organisation a people.json
file where we have exactly one entry per person and we use their identifier to link them to the projects similar to how it is done in the above code snippets with the organisations which seem easier, because for now an organisation is just
The code snippet from the job description imagines the following output:
the task description had
const valuenetwork = {} // => valuenetwork.json
const projects = {} // => projects.json
const organisations = {} // => organisations.json
function add (package_json_url, package_json, dependencies, dependents) {
const url = package_json_url // e.g. https://github.com/hypercore-protocol/hypercore
// @INFO: what we are interested in:
const { name, version, description, author, homepage, keywords = [], license, repository = {} } = package_json
const package = { name, version, description, author, homepage, keywords, license, repository }
// `dependents` is an array of github repository urls
const customers = dependents
// `dependencies` is an array of github repository urls
const suppliers = dependencies
const org = url.split('/').slice(0, -1).join('/')
const project = {
name: package.name,
version: package.version,
description: package.description,
keywords: package.keywords,
homepage: package.homepage,
bugs: package.bugs,
license: package.license,
people: [package.author, ...package.contributors],
funding: package.funding,
repository: package.repository,
}
// e.g. https://github.com/hypercore-protocol
valuenetwork[url] = { url, customers, suppliers }
projects[url] = { url, org, blessed: true || false, project } // blessed true means its in `blessed.json`
organisations[org] = { url: org, projects: [url] }
}
I was thinking the url
would be a unique identifier, but some people might install from npm, some might install from github and some from elsewhere. The package.json
also has options for not using the npm name but something else, so maybe this isn't as trivial as i thought, but i hoped we could just roll with some sort of standardized convention ti end up with a unique url.
the valuenetwork.json
is otherwise just supposed to be an object (a map) to map the package/repo/module/etc... we could just always call it a "project"... and the projects unique identifier (e.g. a canonical URL) to an array of dependents (=customers) and dependencies (=suppliers). These are the edges of the graph which forms the "value network"
The projects.json
does not have any kind of information about the relationships between projects, so dependencies and dependents are cut out from this and it only includes the specific information (if available) that is shown in the code snippet above, because that seems to be standard stuff we might be interested in and we might change or add to that in the future, for example social media accounts or whatever we think makes sense to have
The organisation.json
file additionally given the repo url usually - almost always - gives away the orgname or username the repo belongs to and that gives as the option to grab some additional information from there
...even though it is not listed, having also a people.json
file would be amazing, but i just skipped it in the first iteration because i thought that gets quite involved to also scrape the contributors from commits and other things
@martinheidegger if you don't mind it would be really cool if we could standardize the data format of the output and document and standardize it essentially by giving good example entry for each of the output files instead of a "type definition", but i think that is what we need so that we can then base any frontend we might make on that output and know it wont change.
@martinheidegger do you think you can refine the scraper/crawler soon with the comment above? It is urgently needed :-)
Also, could you update the blessed.json
before you run the scraper for the first real data set?
That is the original blessed.json
i started out with when i was playing with the scraper.
We can then make an agenda item for the consortium to check what the blessed file should include in the future.
I know there are a bunch more important modules, like hyperbee, hyperdrive or autobase, but they are all dependents of hypercore anyway right now, so they will be included
[
{ "npm": "hypercore", "version": "*" },
{ "repoURL": "git+https://github.com/hypercore-protocol/hypercore-next" },
{ "npm": "@hyperswarm/dht", "version": "*" },
{ "npm": "hyperswarm", "version": "*" },
{ "npm": "@hyperswarm/dht-relay", "version": "*" },
{ "npm": "@hyperswarm/secret-stream", "version": "*" },
{ "npm": "hypercore-strong-link", "version": "*" },
{ "npm": "hyperdrive", "version": "*" },
{ "npm": "hyperbee", "version": "*" },
{ "npm": "autobase", "version": "*" }
]
Once that is done, it would be great to run tit once to produce the first data set with the fixed output format (people don't need to be included yet this time around) and publish that to a new github repository.
Then we can close the task :-)
After a lot of experimentation and trying to figure out bugs in the data set I am thoroughly exhausted of this work. Anyways, there is a lot to write about why this data structure is as it is but I need some sleep. I will write some more about this once I am a bit healthier, but see this zip for the output data:
hm, i quickly checked, and i am not entirely sure which fields will be included and which wont in all cases, but i imagined to see valuenetwork.json
, projects.json
and organizatipns.json
structured in the way shown in the previous comments code snippet and to skip people for now, or rather even if the people are scraped, that the output doesnt yet include people.
now if on top of the above we also already have a people.json
i guess thats ok, and if we have also errors.json
, etc... thats fine, but the above files are missing and the content of the current files dont look at all like the expected output.
hmmm... thats just a bit confusing
Following our conversation I added documentation to the scraper, cleaned & changed the output data.
https://github.com/dat-ecosystem/dat-garden-rake#dat-garden-rake
Currently there is a github action running with a clear cache that hopefully - once finished - will publish the data through github pages. https://github.com/dat-ecosystem/dat-garden-rake/actions/runs/2597447898
This is the output of a recent, local execution:
Finally I managed to get the scraper to complete on github actions. The gh-pages branch contains the latest data (which means it also keeps previous run-results in storage). You will find the published version here: https://dat-ecosystem.org/dat-garden-rake/index.json
With the scraper now running weekly and producing versioned data I am considering my work on this finished. Can we close this issue?
@martinheidegger Thanks for the work on this task. Much appreciated :)
@todo
@input
:package: https://npm.org@input
:package: https://github.com@output
:package:(see ##info section below)
@output
:package:(see ##info section below)
@output
:package:screencast video about scraper
@input
:package:(see ##info section below)
@input
:package:screencast video about scraper
@input
:package:./data/blessed.json
@output
:package:scraper/crawler code
@input
:package:./data/blessed.json
(with[ 'https://github.com/hypercore-protocol/hypercore' ]
)@input
:package: [scraper/crawler code
]@input
:package:scraper/crawler code
@output
:package:./<timestamp>/valuenetwork.json
@output
:package:./<timestamp>/packages.json
@output
:package:./<timestamp>/organisations.json
@output
:package:./<timestamp>/index.json
@output
:package:./index.json
@info
estimated duration:
2 days
estimated budget:
640 usd
concept
scraper can be executed locally to scrape
package.json
data from npm and github and crawl for alldependents
anddependencies
and repeat the process, starting from ablessed.json
list of initial github repositories until all dependents and dependents of dependents, but also all dependencies and dependencies of dependencies have been found and saved as timestamped json files to disk, so they can be committed and pushed to a github repository with the results.deal with rate limits:
of course, a task cant be resumed if its not the same day anymore, because that would produce a different timestamped json and therefore needs a fresh run anyway that wipes the database before trying to scrape everything from scratch
basic prototype
what to store in the files mentioned in the tasks above
how to query for dependents on github and npm