birros / web-archives

A web archives reader
https://flathub.org/apps/details/com.github.birros.WebArchives
GNU General Public License v3.0
102 stars 13 forks source link

WARC file support? #27

Open anarcat opened 2 years ago

anarcat commented 2 years ago

Hi!

When I found out about this project, its name made me think it was a tool to read WARC files, which stands for... Web ARChives!

Is there support for WARC planned? it would be pretty interesting because it would allow the reader to use archive.org extracts, which are typically in the WARC file format. Other web crawlers (e.g. wget, but also web browsers) can also output WARC files...

birros commented 2 years ago

Hi @anarcat, thanks for your interest.

Unfortunately, as mentioned in this https://github.com/birros/web-archives/issues/12#issuecomment-596253036 I don't want to spend time on this project before I finish another one (private for now, but open in the future) that will improve the personal data management for this application (and others).

But your request is relevant, I discovered warc after zim format, both are relevant in different cases. I plan to implement warc / wacz support when I restart this project, but don't expect it to happen in the next few months, it won't happen for several years (my private project is really complex).

In the meantime, you can try to convert your warc file to a zim file using this tool from the openzim team (I haven't tested it): openzim/warc2zim. Also check kiwix/kiwix-desktop from the same team as an actively maintained zim reader.

I will keep this issue opened for the future.


You can also open warc file with webrecorder/replayweb.page, which is a free, self-hosting software that works offline.

I recommend converting the warc file to wacz for use with replayweb, which adds page indexing. Use this tool for that: webrecorder/py-wacz

Example:

$ wget "https://en.wikipedia.org/wiki/Linux" \
    --page-requisites \
    --execute robots=off \
    --no-warc-keep-log \
    --span-hosts \
    --no-warc-compression \
    --delete-after \
    --domains en.wikipedia.org,upload.wikimedia.org \
    --warc-file="wikipedia-linux"
$ wacz create wikipedia-linux.warc \
    --detect-pages \
    --output wikipedia-linux.wacz
anarcat commented 2 years ago

On 2022-08-23 04:56:07, Julien Muret wrote:

Unfortunately, as mentioned in this https://github.com/birros/web-archives/issues/12#issuecomment-596253036 I don't want to spend time on this project before I finish another one (private for now, but open in the future) that will improve the personal data management for this application (and others).

That, of course, makes perfect sense. :) Take all the time you need!

But your request is relevant, I discovered warc after zim format, both are relevant in different cases. I plan to implement warc / wacz support when I restart this project, but don't expect it to happen in the next few months, it won't happen for several years (my private project is really complex).

In the meantime, you can try to convert your warc file to a zim file using this tool from the openzim team (I haven't tested it): openzim/warc2zim. Also check kiwix/kiwix-desktop from the same team as an actively maintained zim reader.

Oh that's really neat, thanks! I was aware of kiwix, but not warc2zim, that makes a lot of sense...

I will keep this issue opened for the future.

Thanks!

-- People in glass houses shouldn't throw stones. People in glass cities shouldn't fire missiles.