PromyLOPh / crocoite

Web archiving using Google Chrome
https://6xq.net/crocoite/
MIT License
42 stars 7 forks source link

Also download M3U8 content #22

Open eSoares opened 5 years ago

eSoares commented 5 years ago

In some websites that embed videos, the video is loaded using HTTP Live Streaming (HLS).

The process from the point of view of the web site creator is quite simple, they create a M3U8 playlist that describes the different segments of the video, and use the playlist as source in the HTML video element. The first step is usually automatically done by the encoder, FFMPEG has support for this.

There are some characteristics to take in to account, such as the M3U8 playlist describes a live stream or VOD and different quality levels available for adaptability.

As a minimum and by default, download the highest quality should be an acceptable implementation. But a better approach would be giving option to download all qualities, highest only or don't download the M3U8 playlist referenced media.

This would bring support for all websites that have content distributed using HLS, Twitter posts with videos are an example.

eSoares commented 5 years ago

I have been looking at the different implementations possibilites, and hame some notes that can help to someone who wants to implement this. Unfortunately I'm not skilled enough nor have the time to master it, at the moment, to implement in the current project a good solution.

Some possible implementations notes:

  1. One possible approach is implement a behaviour, similar to ExtractLinks, that extracts M3U8 media segments. The JS in the browser could use something like (this parser)[https://github.com/globocom/m3u8].

  2. An alternative approach is when processing the content downloader by the browser, if the content is a M3U8 playlist, parse it and download the content.

  3. Make an extra step/tool to read warc files and for each M3U8 playlist there, download the media content, append it to the warc (or generate a new warc with the same requests as the original warc + the media content).

In my opinion, solution 1 is the cleanest. Solution 3 is the dirtiest since downloads related content at two different points in time.

PromyLOPh commented 5 years ago

I wouldn’t consider option 3 “dirty”. In fact, it’s pretty clean and you can easily add a conversion record to the WARC containing the full video downloaded by, say “youtube-dl”, and referencing the original M3U8. Another option would be to click all play buttons for <video> and <audio> tags, wait until every one of those finishes playing, limit the network speed, rinse and repeat.

eSoares commented 5 years ago

The option 3 can be easily implemented using FFMPEG, the download of the various elements and the complete media record output. The only missing piece is appending the content to the warc file.

The latest option "clicking all play buttons" is the closest to the normal web browsing execution. But is the slowest, since the browser downloads media on demand and would take the time of the slowest media in the page. On the upside, would be generic across all media types.

Ofc, all option need to be careful about encounter live-streams that never end.