Global deduplication for specific URLs

JustAnotherArchivist commented 4 years ago

While global deduplication for everything in ArchiveBot is not feasible, we should consider adding something for certain URLs that waste a lot of disk space, probably shouldn't be ignored entirely, but are regrabbed needlessly and repeatedly. Two examples come to mind:

CBC radio recordings/podcasts: ignores ^https?://mp3\.cbc\.ca/ and ^https?://podcast-a\.akamaihd\.net/mp3/ (pending further investigation whether the latter also has non-CBC content)
Fast Company videos: ignore ^https?://content\.jwplatform\.com/videos/

Currently, these ignores are typically manually added when someone sees it. I know we've grabbed some of those URLs thousands of times, but others were never covered before. Because the contents on these hosts don't change with time, ignoring them if they've ever been grabbed before by some AB job should be fine. However, job starting URLs should not be checked against the dedupe list so that they can be saved again if needed – specifically, this means that URL table entries with level = 0 would always be retrieved.

An implementation would probably keep the dedupe DB and the list of URL patterns to be checked against it on the control node. The latter is pushed to the pipelines (and updated if it changes), then the pipeline queries the DB on encountering a matching URL. TBD is whether the pipeline should be able to directly add entries to the DB or whether they should come from the CDXs in the AB collection. The latter is more trustworthy (and also covers the unfortunate case when archives are lost between retrieval and IA upload) but adds a delay which can still lead to repeated retrieval. Alternatively, pipelines could add a temporary entry which gets dropped after a few days if it isn't confirmed by the CDXs.

JustAnotherArchivist commented 4 years ago

New York Times videos: ^https?://vp\.nyt\.com/ and ^https?://video1\.nytimes\.com/
Videos hosted on JW Player: ^https?://cdn\.jwplayer\.com/videos/ (Note, these are different from the content.jwplatform.com one above. cdn.jwplayer.com serves various things by customers, content.jwplatform.com seems to only have FastCo videos.)

An alternative solution would be to dedupe based on data type or size, but that would require a new download every time and might slow down some crawls massively. If we go down this road, we should write revisit records for those; wpull already has support for that, it would just have to be activated and the remote calls implemented through a custom URLTable.

JustAnotherArchivist commented 3 years ago

Washington Post videos: ^https?://d21rhj7n383afu\.cloudfront\.net/washpost-production/ and ^https?://videos\.posttv\.com/washpost-production/
USA Today: ^https?://videos\.usatoday\.net/Brightcove2/
AnyClip: ^https?://cdn([1-9]|1\d|20)\.anyclip\.com/.*\.mp4$
NYT, rarer than the above: ^https?://int\.nyt\.com/data/videotape/finished/.*\.mp4$ (e.g. job 2s9go7u96uf6eyhtlpemdpouu)
Wall Street Journal: ^https?://m\.wsj\.net/video/.*\.mp4$
ESPN: ^https?://media\.video-cdn\.espn\.com/.*\.mp4$ and ^https?://media\.video-origin\.espn\.com/.*\.mp4$
IGN: ^https?://assets\d+\.ign\.com/videos/zencoder/.*\.mp4$ and ^https?://s3\.amazonaws\.com/o\.videoarchive\.ign\.com/.*\.mp4$
MLB: ^https?://(cuts\.diamond|mlb-cuts-diamond)\.mlb\.com/FORGE/.*\.mp4$

JustAnotherArchivist commented 3 years ago

Alternative for a proper global dedupe (which would likely require changes in wpull because the URLTable methods aren't async): special ignores that send the URL to a logger. Then we regularly dedupe what the logger receives and run those URLs separately in !ao < jobs, e.g. weekly or monthly (automated).

ArchiveTeam / ArchiveBot

Global deduplication for specific URLs #443