Open Mr0grog opened 6 years ago
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.
We currently have a few FTP directories we monitor, but we don’t actually handle them very well.
When we have snapshots from Wayback, we compare them poorly: https://monitoring.envirodatagov.org/page/a0ba1338-2d04-4ba4-9487-c6ff9be0383b/97490127-9a81-474b-a3dc-29afa943edfd..33a00df8-665b-40e8-806d-9033b57f1588
And from Versionista, we fail to display anything useful at all: https://monitoring.envirodatagov.org/page/a0ba1338-2d04-4ba4-9487-c6ff9be0383b/ae7f6a24-bc8e-4cdd-860a-0eaa354f00ac..33a00df8-665b-40e8-806d-9033b57f1588
The real issues under the hood:
When we get FTP listings out of Versionista, we wind up storing them as
application/octet-stream
, which means we wind up treating them like binary data later on. We could store them astext/plain
(see Wayback below) or we could make something more specific. (See also edgi-govdata-archiving/web-monitoring-versionista-scraper#166)When we get FTP listings out of Wayback, we wind up storing them as
text/plain
, which at least makes them displayable and diffable, but we don’t diff them in a particularly useful way:In the UI, we are parsing mime types poorly and we read wayback’s
text/plain
astext/html
, so we don’t give it the most friendly visualization (edgi-govdata-archiving/web-monitoring-ui#322).We could diff these as plain text, which is moderately useful:
But it might be nice to have a fancier diff in this case, like we do for links. I think we’d probably need a new mime-type for this (e.g.
text/x-wm-ftp-directory
ortext/ftp-dir-listing
[this is what Versionista appears to be sending, which is non-standard but also used by some other tools]), though we could possibly also detect that it’s an FTP listing by checkingversion.capture_url.startsWith('ftp://')
in the UI.