bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
78 stars 48 forks source link

Crawling curated list of sites: Outgoing links from Arabic Wikipedia #300

Closed yjernite closed 2 years ago

yjernite commented 2 years ago

We want to be able to obtain all web and media content associated with a specific list pre-identified domain names.

This issue tracks domain names for links that are used as references in Arabic Wikipedia

We can then compare this mostly-automated approach to the more curated ones to see whether it can complement them.

The steps to follow are:

  1. filter the CommonCrawl (or another archive) for all WARC records with one of the given domain names
    • filtering all dumps form the last two years
  2. obtain overall metrics and metrics per domain name
    • page counts, content languages, content types, etc.
  3. upload all of the relevant WARC records for each domain name to a HF dataset in the BigScience Catalogue Data Organization
    • minimal filtering of WARC records to include human-readable pages AND pages that reference links to objects we want to download (e.g. PDFs)
    • Extract the HTML tags corresponding to all URLs in the WARC entries
    • optional: post-process the above list to identify outgoing links, extract their domain name, and content type
    • optional: run text extraction

In particular, the list of domain names mentioned in outgoing link may be used to obtain a "depth 1 pseudo-crawl" by running the same process again

cc @sebastian-nagel

yjernite commented 2 years ago

Note: if we want to run a similar analysis in other languages, @hadyelsahar provided the following script to obtain the list: https://github.com/scribe-wikimedia/research/blob/master/references/whitelist-creation/create-whitelist-from-dump.py

yjernite commented 2 years ago

No bandwidth to analyze the results, we might revisit this one later