WARC File Comparison - Githubissues

ibnesayeed commented 4 years ago

Compare and contrast the resulting WARC files on the https://odu.edu/compsci URI generated by any two of the following tools:

Neyo-odu commented 4 years ago

WARCreate:

This Web Archiver places all pertinent information at the top of the file for quick access in plain text. It records each link found on the page and places them in the outlink section. then gives you the HTTP request information and the source code. From there you can use WAIL software to open the .warc file.

webrecorder.io:

This Web Archiver is very simple to use. You input the URL you want to archive and it will open it in its own browser. You have to scroll through the web page (you can also use the auto-pilot tool) and it will gather any information it can. From there your archive is saved and you can go back to it any time. From what I can tell its super in-depth and grabs just about everything. The web page will look exactly like its real counterpart. The images and media content is saved to their server and the WARC file isn't legible by a text-reader but it contains everything essential. The source code is the same except everything is linked through the webrecorder.io server.

ibnesayeed commented 4 years ago

The images and media content is saved to their server

Are you suggesting that the embedded resources like images and media are not packaged in the WARC file, but hosted on their server separately and referenced from the WARC file?

... and the WARC file isn't legible by a text-reader but it contains everything essential

This perhaps is because they use .warc extension for files that should actually be .warc.gz. They do it to avoid automatic extraction done by MacOS (neither do I like Apple's behavior here nor the misleading workaround of Webrecorder). If you append .gz at the end of the downloaded .warc file and then unzip it, you will find it equally as legible in a text editor as other WARC files. Please explore it and report your findings back.

The source code is the same except everything is linked through the webrecorder.io server.

Please elaborate on this, what do you mean by, "everything is linked through the webrecorder.io server"?

Neyo-odu commented 4 years ago

Are you suggesting that the embedded resources like images and media are not packaged in the WARC file, but hosted on their server separately and referenced from the WARC file?

I couldn't surmise if the information was stored in the file or simply linked to their sever the source code would change the URL as such:

Source: //fonts.googleapis.com/css?family=Open+Sans:400,600,800,700|Open+Sans+Condensed:300

WARC: //content.webrecorder.io/[USERNAME]/default-collection/20191212043929cs///fonts.googleapis.com/css?family=Open+Sans:400,600,800,700|Open+Sans+Condensed:300_

which could indicate the content is being pulled into the local server but it's much more likely it is stored in the .warc file upon further review.

This perhaps is because they use .warc extension for files that should actually be .warc.gz. They do it to avoid automatic extraction done by MacOS (neither do I like Apple's behavior here nor the misleading workaround of Webrecorder). If you append .gz at the end of the downloaded .warc file and then unzip it, you will find it equally as legible in a text editor as other WARC files. Please explore it and report your findings back.

This did make the file partially legible. There are portions that contain image / other forms of data that wouldn't be legible in a text reader though. With this working, I can with some assurance say that the content is actually stored within the .warc file. The data can be understood is HTTP response information and some WARC metadata.

Please elaborate on this, what do you mean by, "everything is linked through the webrecorder.io server"?

For example:

<link rel="stylesheet" href="/etc/designs/odu/clientlibs/libs/slick.min.css" type="text/css">
<link rel="stylesheet" href="/etc/designs/odu/clientlibs.min.css" type="text/css">
<script type="text/javascript" src="/etc/designs/odu/clientlibs/libs/slick.min.js"></script>
<script type="text/javascript" src="/etc/designs/odu/clientlibs.min.js"></script>
<link href="/etc/designs/odu.css" rel="stylesheet" type="text/css">

is changed to:

<link rel="stylesheet" href="https://webrecorder.io/[USERNAME]/default-collection/20191212043929cs_/https://odu.edu/etc/designs/odu/clientlibs/libs/slick.min.css" type="text/css">
<link rel="stylesheet" href="https://webrecorder.io/[USERNAME]/default-collection/20191212043929cs_/https://odu.edu/etc/designs/odu/clientlibs.min.css" type="text/css">
<script type="text/javascript" src="https://webrecorder.io/[USERNAME]/default-collection/20191212043929js_/https://odu.edu/etc/designs/odu/clientlibs/libs/slick.min.js"></script>
<script type="text/javascript" src="https://webrecorder.io/[USERNAME]/default-collection/20191212043929js_/https://odu.edu/etc/designs/odu/clientlibs.min.js"></script>
<link href="https://webrecorder.io/[USERNAME]/default-collection/20191212043929cs_/https://odu.edu/etc/designs/odu.css" rel="stylesheet" type="text/css">

So it appears the information is hosted on their server but I cannot confirm this is how it works.

[USERNAME] = is my redacted username

ibnesayeed commented 4 years ago

I couldn't surmise if the information was stored in the file or simply linked to their sever the source code would change the URL as such:

Source: //fonts.googleapis.com/css?family=Open+Sans:400,600,800,700|Open+Sans+Condensed:300

WARC: //content.webrecorder.io/[USERNAME]/default-collection/20191212043929cs///fonts.googleapis.com/css?family=Open+Sans:400,600,800,700|Open+Sans+Condensed:300_

which could indicate the content is being pulled into the local server but it's much more likely it is stored in the .warc file upon further review.

This is called URL-rewriting and it is necessary for proper replay of archived resources to avoid live leakage (we call it zombies). This rewriting is done on the fly at the replay time, not in the WARC itself (if you could hunt the WARC file and find it otherwise, it will be something to report as a bug). If you were to replay the same WARC locally, those references will change accordingly.

There are portions that contain image / other forms of data that wouldn't be legible in a text reader though.

This should be no different for WARC files created by other tools. If you attended the relevant lecture, I did mention that the payload could be binary, but both HTTP and WARC headers are text-based. Did you not find binary data in the WARC created by the other too you are comparing it against?

Also, can you summarize number of WARC records of different types (such as request, response, end metadata etc.) in the two tools? This would help estimate which tool is more effective in discovering most of the resources. You can use some WARC processing tools (such as warcio) to analyze this.

cs531-f19 / discussions

WARC File Comparison #85