ArchiveTeam / wpull

Wget-compatible web downloader and crawler.
GNU General Public License v3.0
545 stars 77 forks source link

Stop adding `data:` and `mailto:` URIs to the database #483

Open JustAnotherArchivist opened 5 months ago

JustAnotherArchivist commented 5 months ago

As of wpull 2.0.3, data: and mailto: URIs get added to the database, although neither serves any purpose. Not only are these schemes unsupported, there's also nothing to be retrieved for them anyway. tel: URIs (currently entirely unsupported and treated as relative paths instead) should likely also be treated the same.

As an extreme example of the impact in the real world: an ArchiveBot job's database grew to 106 GB over the past couple days due to data: URIs embedded in every page. After purging these URIs with (likely not the most efficient approach)

sqlite3 wpull.db 'SELECT id FROM url_strings WHERE url LIKE "data:%"' | sed 's,^.*$,UPDATE url_strings SET url = "data:<removed-&>" WHERE id = &\;,' >cmds
sqlite3 wpull.db <cmds
sqlite3 wpull.db VACUUM

the database size dropped to 860 MB.