write crawler/scraper to get the data

serapath commented 2 years ago

`@todo`

[ ] write crawler/scraper
- @input :package: https://npm.org
- @input :package: https://github.com
- [x] write basic scraper/crawler concept
- @output :package: (see ##info section below)
- [x] implement basic prototype
- @output :package: (see ##info section below)
- @output :package: screencast video about scraper
- [x] implement full scraper to
- @input :package: (see ##info section below)
- @input :package: screencast video about scraper
  - @input :package: ./data/blessed.json
- [x] find all repos with direct or indirect dependencies on repos on the blessed list
- [x] find all repos with direct or indirect dependents on repos on the blessed list
  - @output :package: scraper/crawler code
- [x] run scraper to get data and deal with rate limits
  - @input :package: ./data/blessed.json (with [ 'https://github.com/hypercore-protocol/hypercore' ])
  - @input :package: [scraper/crawler code]
  - [ ] make sure a first dataset is written to json files
  - @input :package: scraper/crawler code
  - [x] run scraper to get data and deal with rate limits
  - @output :package: ./<timestamp>/valuenetwork.json
    - next #17
  - @output :package: ./<timestamp>/packages.json
    - next #17
  - @output :package: ./<timestamp>/organisations.json
    - next #17
  - @output :package: ./<timestamp>/index.json
    - next #17
  - @output :package: ./index.json
    - next #17

`@info`

estimated duration:

2 days

estimated budget:

640 usd

concept

scraper can be executed locally to scrape package.json data from npm and github and crawl for all dependents and dependencies and repeat the process, starting from a blessed.json list of initial github repositories until all dependents and dependents of dependents, but also all dependencies and dependencies of dependencies have been found and saved as timestamped json files to disk, so they can be committed and pushed to a github repository with the results.

deal with rate limits:

instead of executing the scraping task directly, we would create a task object and store it in the "task database"
the task database would then pull out tasks from the front of the queue and execute them and store the results
if the process crashed, we can restart and it would just check if there are tasks in the database and just continue scraping
executing as tasks might make more tasks from the data found in case that data identifies more data to be crawled 🙂
the idea is to run it locally and it spits out json files with a timestamp in the filename and wipe the database

of course, a task cant be resumed if its not the same day anymore, because that would produce a different timestamped json and therefore needs a fresh run anyway that wipes the database before trying to scrape everything from scratch

basic prototype

// 0. helper
const path = require('path')
const fs = require("fs")
const fs_extra = require('fs-extra') // has a .move() function
const rimraf = require('rimraf')
const mkdirp = require('mkdirp')

// 1. for storing intermediate results
const hypercore = require('hypercore')
const Hyperbee = require('hyperbee')
const ram = require('random-access-memory')

// 2. for doing the scraping and crawling from npm and github
const puppeteer = require('puppeteer')
const fetch = require('node-fetch')

// 3. @TODO: implement task executor to manage progress 
const task_executor = require('./task-executor')

// 4. paths to store data
const cwd = process.cwd()
const timestamp = new Date().toISOString().substr(0, 10)
const target = path.join(cwd, `data/${timestamp}`)
const blessed = path.join(cwd, `./data/blessed.json`) // includes a list of blessed github repo links
const cached = path.join(cwd, `./temp`)
// when the it finishes, it should be written as 3 files to disk
// 1. make ./temp/projects.json
// 2. make ./temp/valuenetwork.json
// 3. move ./temp to ./data/<timestamp>
// 4. update ./data/versions.js to include new timestamp
// e.g.
const VERSIONS = { // versions.json
  latest: '2021.9.21',
  list: [
    '2021.9.21',
  ],
}

// 5. initialize databases
const dbOpts = { keyEncoding: 'utf-8', valueEncoding: 'json' }
const feed = hypercore(ram) // temporary database, but persist by passing a folderpath instead of `ram`
const db = new Hyperbee(feed, dbOpts)
await db.ready()
const DB = { // @NOTE: the exact databases in use can be adapted by the implementer
  db,
  // to store intermediate results while scraping and crawling and executing tasks:
  tasksDB      : db.sub('tasks', dbOpts),
  cache_git    : db.sub('cache_git', dbOpts),
  cache_npm    : db.sub('cache_npm', dbOpts),
  projects     : db.sub('projects', dbOpts),
  valuenetwork : db.sub('valuenetwork', dbOpts),
}
const target_dir = (await db.get('target_dir'))?.value
if (target_dir !== target) throw new Error('work in progress is outdated')

// 6. initialize pupeteer
const opts = { headless: true }
const browser = await puppeteer.launch(opts)

// 7. define scraping tasks
const task_handlers = { // @NOTE: the exact task types can be adapted by the implementer
  'blessed': async (task, api) => {
    const { add, get, db } = api
    const { data: url } = task
    // use `browser` and `fetch` to:
    // @TODO: scrape package.json for `url`
    // @TODO: scrape all dependencies for `url`
    // @TODO: scrape all dependents for `url`
  },
  'customer': async (task, api) => {
    const { add, get, db } = api // add more tasks
    const { data } = task
    const url = data
    // @TODO: get all dependents for `url`
    await new Promise(ok => setTimeout(ok, 1000))
    return result // task is considered done once handler returns
  },
  'supplier': async (task, api) => {
    const { add, get, db } = api // add more tasks
    const { data } = task
    // @TODO: scrape package.json for `url`
    // @TODO: get all dependencies for `url`
    await new Promise(ok => setTimeout(ok, 1000))
    // task is considered done once handler returns task result or throws an error
    if (Math.random() > 0.5) return result
    else throw new Error('failed to execute task')
  },
  'fetch-package': async (task, api) => {
    const { add, get, db } = api // add more tasks
    const { data } = task
    // @TODO: scrape package.json for `url`
    // @TODO: get all dependencies for `url`
    await new Promise(ok => setTimeout(ok, 1000))
    // task is considered done once handler returns task result or throws an error
    if (Math.random() > 0.5) return result
    else throw new Error('failed to execute task')
  },
  'scrape-dependents': async (task, api) => {
    const { add, get, db } = api // add more tasks
    const { data } = task
    // ...
    await new Promise(ok => setTimeout(ok, 1000))
    // task is considered done once handler returns task result or throws an error
    if (Math.random() > 0.5) return result
    else throw new Error('failed to execute task')
  },
  'scrape-dependencies': async (task, api) => {
    const { add, get, db } = api
    const { data } = task
    // ...
    await new Promise(ok => setTimeout(ok, 1000))
    // task is considered done once handler returns task result or throws an error
    if (Math.random() > 0.5) return result
    else throw new Error('failed to execute task')
  },
  'scrape-dependents': async (task, api) => {
    const { add, get, db } = api
    const { data } = task
    // ...
    await new Promise(ok => setTimeout(ok, 1000))
    // task is considered done once handler returns task result or throws an error
    if (Math.random() > 0.5) return result
    else throw new Error('failed to execute task')
  },
  'store-in-files': async (task, api) => { // maybe at the end
    const { add, get, db } = api // add more tasks
    const { data } = task
     // ....
    await new Promise(ok => setTimeout(ok, 1000))
    // task is considered done once handler returns task result or throws an error
    if (Math.random() > 0.5) return result
    else throw new Error('failed to execute task')
  }
}
const opts = {
  on: task_handlers,
  retry (task, error) {
    return Math.radnom() > 0.5 ? task : error
    // return task => retry task execution
    // return error => abort all task execution and fail iwth error
  },
  MAX_WIP: 5, // max work in progress tasks
  TIMEOUT: 5000 // max time per task
}

// a task is just some sort of json object with a type field and a data field
// to describe what should happen and then we store it in the task database and execute it
// when the result comes in, we store the result in the data base and delete the task from the task database.
// ...now the process of executing the tasks itself (e.g. scraping a modue from npm/github)
// might have found additional dependents or dependencies and thus created new tasks,
// which are then executed and so forth...
// ...until no more tasks are created and all tasks are done/removed from the task database

const tasks = [{ type: 'blessed', data: 'https://github.com/hypercore-protocol/hypercore' }]
await task_executor(tasks, opts)

// if task_executor finishes without error, all scraped data should be in the database
// and can be taken from there to save it in json files

what to store in the files mentioned in the tasks above

const valuenetwork = {}
const projects = {}
const organisations = {}

function add (url, package, dependencies, dependents) {
  // `url` is a github package json url
  // e.g. https://github.com/hypercore-protocol/hypercore
  // `package` is the content of the package json
  // `dependents` is an array of github repository urls
  const customers = dependents
  // `dependencies` is an array of github repository urls
  const suppliers = dependencies
  const org = url.split('/').slice(0, -1).join('/')
  const project = {
    name: package.name,
    version: package.version,
    description: package.description,
    keywords: package.keywords,
    homepage: package.homepage,
    bugs: package.bugs,
    license: package.license,
    people: [package.author, ...package.contributors],
    funding: package.funding,
    repository: package.repository,
  }
  // e.g. https://github.com/hypercore-protocol
  valuenetwork[url] = { url, customers, suppliers }
  projects[url] = { url, org, blessed: true || false, project } // blessed true means its in `blessed.json`
  organisations[org] = { url: org, projects: [url] }
}

how to query for dependents on github and npm

// ------------------------------------------------------------------------
// HELPER
// ------------------------------------------------------------------------
function get_rawgithub_url (url) {
  LOG.TODO('LATER: support non-"master" branch repos + support non-github urls + non-npm registries')
  const rawgithub_packagejson_url = `https://${path.join('raw.githubusercontent.com/', url.split('/').slice(3).join('/'), 'master/package.json')}`
  return rawgithub_packagejson_url
}
// ------------------------------------------------------------------------
function get_rgithub_dependents_page_url (repourl) {
  LOG.TODO('LATER: support non-published-package repository customers')
  console.log(`[SCRAPER:${type}]: scrape customers for:`, repourl)
  // const repourl_repos = `${repourl}/network/dependents?dependent_type=REPOSITORY`
  const github_dependents_page_url = `${repourl}/network/dependents?dependent_type=PACKAGE`
  return github_dependents_page_url
}
// ------------------------------------------------------------------------
function query () {
  const [el] = document.querySelectorAll('#dependents')
  const dependents = [...el.querySelectorAll(`.Box-row`)]
  var customers = dependents.reduce((_customers, el) => {
    const span = el.children[1]
    const is_github = span.getAttribute('data-repository-hovercards-enabled') === ''
    if (is_github) {
      const anchor = [...el.querySelectorAll('a')].pop()
      const href = new URL(anchor.getAttribute('href'),'https://github.com').href
      _customers.github.push(href)
    } else {
      _customers.npm.push(span.textContent.trim())
    }
    return _customers
  }, { github: [], npm: [] })
  return customers
}
// ----------------------------------------------------------------------------
// scrapes "npm" (through unpkg.com) for repos package.json content
// ----------------------------------------------------------------------------
async function get_package_json_from_rawgithub (url) {
  console.log(`[SCRAPER:${type}]: download package.json with suppliers/dependencies for:`, url)
  const package = await fetch(url).then(async response => {
    if (!response.ok) throw new Error(`No 2xx response for: ${url}`)
    else return response.json()
  })
  LOG.TODO`LATER: maybe also include devDependencies and/or other dependencies too`
  const        { name, version, description, author, homepage, keywords = [], license, repository = {}, dependencies = {} } = package
  const meta = { name, version, description, author, homepage, keywords,      license, repository }
  const deps = Object.entries(dependencies)
  console.log({deps}) /* npm names */
  throw new Error(`
    @TODO: convert suppliers into urls
  `)
  return[ meta, deps ]
}
// ----------------------------------------------------------------------------
// scrape names and urls of dependents from github
// ----------------------------------------------------------------------------
async function scrape_githubsite_for_customers (type, browser, repourl) {
  console.log(`[SCRAPER:${type}]: open fresh page for "${repourl_pkgs}"`)
  const page = await browser.newPage()
  console.log(`[SCRAPER:${type}]: goto "${repourl_pkgs}"`)
  await page.goto(repourl_pkgs)
  console.log(`[SCRAPER:${type}]: wait for loading "${repourl_pkgs}"`)
  await page.waitForSelector('#dependents')
  await new Promise(r => setTimeout(r, 500))
  console.log(`[SCRAPER:${type}]: execute query on  "${repourl_pkgs}"`)
  const customers = await page.evaluate(query)
  await page.close()
  const pages = await browser.pages()
  const remaining = pages.length - 1
  console.log(`[SCRAPER:${type}]: processing complete for: "${repourl_pkgs}"`)
  console.log(`[SCRAPER:${type}]: currently remaining: ${remaining}`)
  const github = [... new Set(customers.github)]
  const npm = [... new Set(customers.npm)]
  console.log({ github, npm }) /* github urls (+ npm names) */
  throw new Error(`
    @TODO: convert customers into urls
  `)
  return { github, npm }
}
// ----------------------------------------------------------------------------
// scrapes "github" and "npm" dependents from github repo page
// ----------------------------------------------------------------------------
function query () {
  const [el] = document.querySelectorAll('#dependents')
  const dependents = [...el.querySelectorAll(`.Box-row`)]
  var customers = dependents.reduce((_customers, el) => {
    const span = el.children[1]
    const is_github = span.getAttribute('data-repository-hovercards-enabled') === ''
    if (is_github) {
      const anchor = [...el.querySelectorAll('a')].pop()
      const href = new URL(anchor.getAttribute('href'),'https://github.com').href
      _customers.github.push(href)
    } else {
      _customers.npm.push(span.textContent.trim())
    }
    return _customers
  }, { github: [], npm: [] })
  return customers
}

// ----------------------------------------------------------------------------
// ADD SUPPLIERS - recursively fetches dependencies package information from unpkg.com
// ----------------------------------------------------------------------------
var countA = 0
async function addAllSuppliers (git_parenturl, dependencies) {
  var head = countA++
  if (CACHE_GIT[git_parenturl]) return
  console.log('[SCRAPER:SUPPLIERS]: addAllSuppliers dependencies for:', head, parenturl)
  try {
    const pkgs = []
    for (var i1 = 0, len1 = dependencies.length; i1 < len1; i1++) {
      const [name, version] = dependencies[i1]
      const npm_packageurl = `https://unpkg.com/${name}@${version || 'latest'}/package.json`
      if (CACHE_NPM[npm_packageurl]) continue // pkgs.push(CACHE_NPM[npm_packageurl].package)
      else {
        pkgs.push(new Promise(async (resolve, reject) => {
          const package = await get_package('SUPPLIERS', npm_packageurl)
          const info = { npm_packageurl, name, version, package  }
          CACHE_NPM[npm_packageurl] = info
          resolve(info)
        }))
      }
    }
    const projects = await Promise.all(pkgs)
    const suppliers = []
    for (var i2 = 0, len2 = projects.length; i2 < len2; i2++) {
      const { packageurl, name, version, package: { dependencies: deps, meta: pkg } } = projects[i2]
      var currenturl = ''
      if (pkg.repository) {
        if (typeof pkg.repository === 'string') currenturl = pkg.repository
        else currenturl = pkg.repository.url
      } else {
        LOG.TODO('@FIXME: if no `repository` defined, set a @todo flag in the data')
        continue
      }
      if (currenturl.endsWith('.git')) currenturl = currenturl.slice(0, -4)
      if (currenturl.startsWith('git+')) currenturl = currenturl.slice(4)
      if (currenturl.startsWith('git://')) currenturl = currenturl.replace('git://', 'https://')
      if (!currenturl) {
        currenturl = `@FIXME:${packageurl}`
        LOG.TODO(`found a repo with no url defined: "${currentURL}"`)
        return LOG.ERROR(`found a repo with no url defined: "${currentURL}"`)
      }
      currenturl = currenturl + `#${version}`
      updatePackage(currenturl, { meta: pkg, blessed: false })
      updateNetwork(currenturl, { customers: parenturl })
      updateNetwork(parenturl, { suppliers: currenturl })
      suppliers.push(addAllSuppliers(currenturl, deps))
    }
    await Promise.all(suppliers)
    console.log('[SCRAPER:SUPPLIERS]: FINISH addAllSuppliers', head)
  } catch (err) {
    LOG.ERROR(`failed to add suppliers of "${parenturl}"`, err)
  }
}

martinheidegger commented 2 years ago

Having worked on the scraper for a while now it is pretty stable and extends out to gitlab repositories and is a bit more stable able people attached to a project. You can find the current version here (not published to npm).

Technology stuff:

You can stop and restart the script at any time without being worried about data loss
It uses the latest level db as data storage.
It supports concurrent fetching of things
If it encounters a Rate limitation the task in question will be queued back and it will be executed next time there is space.
A general-purpose, stable multi-step, long-running task system is hidden somewhere in here...

I fear the output data format doesn't match the specification quite yet: 2022-06-21T201258_912Z.zip

serapath commented 2 years ago

sweet :-) will check in detail later. sounds great. i guess if we can run it from time to time and commit the different timestamped versions we get to a public repo so static pages can pull it from there to do interesting things with it would be cool.

thank you so much :smiley:

ok, checked the code. i guess it is fine. it seems to work different than what i expected, but there are many ways to write things. I was hoping it would work without any API tokens, so anyone could just clone and run the thing if they wanted to be honest, without signing up to github and creating an API token, but at least we have it now, thats cool :-)

also, it seems the zip file produces a packages.json and repos.json, but it is not clear what is or can be in those?

I checked your example output, which seems to include packages.json and repos.json which to me feels like the same thing.

I saw that the packages.json has per package:
1. it has a unique npm based package name with version i guess
2. package info fields (including timestamp and size) - did you download and measure for each package?
3. it has people (usually 1-2 people) - is that representative and includes all the commit contributors?
4. it has dependencies
5. ...but i don't see dependents
And then i saw repos.json
1. it has a git url (hopefully unique identifier - would be cool to use it to map to corresponding packages entries)
2. it has package ...i guess additional information about the package?
  - ideally i was hoping the url or whatever is used as a key in the map (object) would uniquely identify it
  - but package is also an array ... which is that? is that because of multirepo support? does anyone do that?
3. it has dependencies :+1:
4. it has dependents :+1: ...by the way, here they are listed as more canonical urls which i like
5. it can have contributors ...is that the same as people or different? now we have it twice?

I think the reason i did not include person/people is because it was unclear how to deal with them and it seemed like lots of work and now we have people and contributors and its not standardized and ideally would have of course similar to organisation a people.json file where we have exactly one entry per person and we use their identifier to link them to the projects similar to how it is done in the above code snippets with the organisations which seem easier, because for now an organisation is just

an organisation url
a list of projects

The code snippet from the job description imagines the following output:

the task description had

const valuenetwork = {}       // => valuenetwork.json
const projects = {}           // => projects.json
const organisations = {}      // => organisations.json

function add (package_json_url, package_json, dependencies, dependents) {
  const url = package_json_url // e.g. https://github.com/hypercore-protocol/hypercore
  // @INFO: what we are interested in:
  const { name, version, description, author, homepage, keywords = [], license, repository = {} } = package_json
  const package = { name, version, description, author, homepage, keywords,      license, repository }
  // `dependents` is an array of github repository urls
  const customers = dependents
  // `dependencies` is an array of github repository urls
  const suppliers = dependencies
  const org = url.split('/').slice(0, -1).join('/')
  const project = {
    name: package.name,
    version: package.version,
    description: package.description,
    keywords: package.keywords,
    homepage: package.homepage,
    bugs: package.bugs,
    license: package.license,
    people: [package.author, ...package.contributors],
    funding: package.funding,
    repository: package.repository,
  }
  // e.g. https://github.com/hypercore-protocol
  valuenetwork[url] = { url, customers, suppliers }
  projects[url] = { url, org, blessed: true || false, project } // blessed true means its in `blessed.json`
  organisations[org] = { url: org, projects: [url] }
}

I was thinking the url would be a unique identifier, but some people might install from npm, some might install from github and some from elsewhere. The package.json also has options for not using the npm name but something else, so maybe this isn't as trivial as i thought, but i hoped we could just roll with some sort of standardized convention ti end up with a unique url.
the valuenetwork.json is otherwise just supposed to be an object (a map) to map the package/repo/module/etc... we could just always call it a "project"... and the projects unique identifier (e.g. a canonical URL) to an array of dependents (=customers) and dependencies (=suppliers). These are the edges of the graph which forms the "value network"
The projects.json does not have any kind of information about the relationships between projects, so dependencies and dependents are cut out from this and it only includes the specific information (if available) that is shown in the code snippet above, because that seems to be standard stuff we might be interested in and we might change or add to that in the future, for example social media accounts or whatever we think makes sense to have
The organisation.json file additionally given the repo url usually - almost always - gives away the orgname or username the repo belongs to and that gives as the option to grab some additional information from there
...even though it is not listed, having also a people.json file would be amazing, but i just skipped it in the first iteration because i thought that gets quite involved to also scrape the contributors from commits and other things

@martinheidegger if you don't mind it would be really cool if we could standardize the data format of the output and document and standardize it essentially by giving good example entry for each of the output files instead of a "type definition", but i think that is what we need so that we can then base any frontend we might make on that output and know it wont change.

serapath commented 2 years ago

@martinheidegger do you think you can refine the scraper/crawler soon with the comment above? It is urgently needed :-)

Also, could you update the blessed.json before you run the scraper for the first real data set? That is the original blessed.json i started out with when i was playing with the scraper. We can then make an agenda item for the consortium to check what the blessed file should include in the future.

I know there are a bunch more important modules, like hyperbee, hyperdrive or autobase, but they are all dependents of hypercore anyway right now, so they will be included

[
  { "npm": "hypercore", "version": "*" },
  { "repoURL": "git+https://github.com/hypercore-protocol/hypercore-next" },
  { "npm": "@hyperswarm/dht", "version": "*" },
  { "npm": "hyperswarm", "version": "*" },
  { "npm": "@hyperswarm/dht-relay", "version": "*" },
  { "npm": "@hyperswarm/secret-stream", "version": "*" },
  { "npm": "hypercore-strong-link", "version": "*" },
  { "npm": "hyperdrive", "version": "*" },
  { "npm": "hyperbee", "version": "*" },
  { "npm": "autobase", "version": "*" }
]

Once that is done, it would be great to run tit once to produce the first data set with the fixed output format (people don't need to be included yet this time around) and publish that to a new github repository.

Then we can close the task :-)

martinheidegger commented 2 years ago

After a lot of experimentation and trying to figure out bugs in the data set I am thoroughly exhausted of this work. Anyways, there is a lot to write about why this data structure is as it is but I need some sleep. I will write some more about this once I am a bit healthier, but see this zip for the output data:

2022-06-30T171009_063Z.zip

serapath commented 2 years ago

hm, i quickly checked, and i am not entirely sure which fields will be included and which wont in all cases, but i imagined to see valuenetwork.json, projects.json and organizatipns.json

structured in the way shown in the previous comments code snippet and to skip people for now, or rather even if the people are scraped, that the output doesnt yet include people.

now if on top of the above we also already have a people.json i guess thats ok, and if we have also errors.json , etc... thats fine, but the above files are missing and the content of the current files dont look at all like the expected output.

hmmm... thats just a bit confusing

martinheidegger commented 2 years ago

Following our conversation I added documentation to the scraper, cleaned & changed the output data.

https://github.com/dat-ecosystem/dat-garden-rake#dat-garden-rake

Currently there is a github action running with a clear cache that hopefully - once finished - will publish the data through github pages. https://github.com/dat-ecosystem/dat-garden-rake/actions/runs/2597447898

This is the output of a recent, local execution:

2022-07-01T140650_717Z.zip

martinheidegger commented 2 years ago

Finally I managed to get the scraper to complete on github actions. The gh-pages branch contains the latest data (which means it also keeps previous run-results in storage). You will find the published version here: https://dat-ecosystem.org/dat-garden-rake/index.json

martinheidegger commented 2 years ago

With the scraper now running weekly and producing versioned data I am considering my work on this finished. Can we close this issue?

ninabreznik commented 2 years ago

@martinheidegger Thanks for the work on this task. Much appreciated :)

dat-ecosystem / organization