Closed yjernite closed 2 years ago
Note: if we want to run a similar analysis in other languages, @hadyelsahar provided the following script to obtain the list: https://github.com/scribe-wikimedia/research/blob/master/references/whitelist-creation/create-whitelist-from-dump.py
No bandwidth to analyze the results, we might revisit this one later
We want to be able to obtain all web and media content associated with a specific list pre-identified domain names.
This issue tracks domain names for links that are used as references in Arabic Wikipedia
We can then compare this mostly-automated approach to the more curated ones to see whether it can complement them.
The steps to follow are:
In particular, the list of domain names mentioned in outgoing link may be used to obtain a "depth 1 pseudo-crawl" by running the same process again
cc @sebastian-nagel