Seeking remote file with HTTP Range header

floyd-fuh commented 2 years ago

Hi there,

Awesome project, was nerd-sniped when my disc didn't have enough space to download the ChromeOS.

I was wondering if we could work around the necessity of downloading the ChromeOS image to disk or to memory and rather only download the parts we need in each "read".

Technical feasability

As I see it, the main mechanism that is used in inputstreamhelper, is feeding a file-like object into ZipFile:

https://github.com/emilsvennesson/script.module.inputstreamhelper/blob/b21b228c22309ea62ec90627c983fa42ce7c7d4d/lib/inputstreamhelper/widevine/arm_chromeos.py#L322

ZipFile will only read some metadata, such as the end of central directory by using seek/tell/read:

https://github.com/python/cpython/blob/ffa505b580464d9d90c29e69bd4db8c52275280a/Lib/zipfile.py#L1343

You then call the open() function on the ZipFile object which returns an object of type ZipExtFile. Again, ZipExtFile does only read some metadata from the file at this point.

On the ZipExtFile object your code calls seek/read/close etc. and the ZipExtFile does zip-specific things, but it only calls seek/tell/read on the originally given file-like object as well when asked.

Summarised: Any file-like object should work with the current ZipFile approach.

Proposal

Create a file-like HttpFile class that implements seek/tell/read etc. and uses the HTTP Range feature to only fetch certain parts of the zip file from the Google servers. I guess the class will need to be clever about the chunks it caches (e.g. it always keeps a 100MB chunk in memory), so that not every read() call will result in an HTTP request to the Google servers. Instead of downloading the ChromeOS image to disk, pass the HttpFile object into ZipFile.

I just checked and the Google servers where the ChromeOS images are downloaded do support HTTP Range.

Obviously this would need some testing (e.g. with a proxy to see how many HTTP request go out and what a good cache chunk size is).

Pro/Cons

Pro:

More flexibility regarding disk space
Inputstreamhelper could decide on it's own (in the HttpFile class) if chunks are stored in memory or on disk and how large the chunks should be
Less data (number of bytes) might be downloaded from Google servers

Con:

More HTTP requests to Google servers (number of requests) are sent (but this is configurable in the HttpFile class).

Alternatively, it would also be possible to only use this approach if less than the necessary disk space is available.

Was something like this approached before? What do you think?

floyd-fuh commented 2 years ago

So I just went ahead and tried it. The below script extracts libwidevinecdm.so directly in-memory with HTTP range requests.

It allows to trade-off the metrics "number of HTTP requests to Google" and "memory consumption". With minor modifications it would also allow caching on disc instead of in-memory.

Running the code with different cache sizes means (all with a 0MB free disc space requirement except of course for the final libwidevinecdm.so). The peak memory usage is calculated by fil-profile (https://pythonspeed.com/fil/docs/fil/trying.html):

Cache size 3MB: 365 HTTP Range requests to Google in 26 seconds, 86MB peak memory usage Cache size 50MB: 45 HTTP Range requests to Google in 20 seconds, 181MB peak memory usage Cache size 100MB: 22 HTTP Range requests to Google in 19 seconds, 333MB peak memory usage Cache size 200MB: 11 HTTP Range requests to Google in 19 seconds, 624MB peak memory usage Cache size 300MB: 8 HTTP Range requests to Google in 19 seconds, 924MB peak memory usage

I have very fast Internet though.

I'm not entirely sure if making "more HTTP request" to Google is really an issue, because the TCP response size is also large when requesting one big file. The overhead of the HTTP requests is negligible compared to the download size. ~~Additionally, with TLS session resumption (which I hope is used - just checked, it isn't, thank you python) there are enough Optimizations that make it efficient.~~ I guess implementing TLS session resumption would be a good idea, but not absolutely necessary.

I would say this is at least worth a try for users that don't have 1GB of disc space left. But you could also consider it for all users. I think this would really be worth it, because we search for a 8MB file in a 1G remote zip. I wonder if we would change the ext2 parsing code to further optimize the reads on the file (and therefore the HTTP requests). It currently looks like the file is read twice, so the optimal approach would be to additionally cache the chunks on disc (if there is space, if there is no space then just proceed with download chunks via Range).

Note that the main point of the script is the HTTPFile class, the rest is more or less glue code I borrowed from your project to demonstrate how it works (and I changed a couple of things so this works standalone, because I don't have a proper InputStreamHelper dev environment):

from __future__ import absolute_import, division, unicode_literals
import os
from struct import calcsize, unpack
from zipfile import ZipFile
from io import UnsupportedOperation
import ssl

#ctx = ssl.create_default_context()
#ctx.check_hostname = False
#ctx.verify_mode = ssl.CERT_NONE

try:  # Python 3
    from urllib.error import HTTPError, URLError
    from urllib.request import Request, urlopen
except ImportError:  # Python 2
    from urllib2 import HTTPError, Request, URLError, urlopen

def http_file_size(url):
    req = Request(url)
    req.get_method = lambda: 'HEAD'
    #req.set_proxy("localhost:8080", 'https')
    try:
        resp = urlopen(req)#, context=ctx)
        return int(resp.info().get('Content-Length'))
    except HTTPError as exc:
        raise HTTPError("Could not determine Content-Length of " + url)

def http_range(url, from_range, to_range, time_out=40):
    headers = {'Range': 'bytes={}-{}'.format(from_range, to_range)}
    try:
        request = Request(url, headers=headers)
        #request.set_proxy("localhost:8080", 'https')
        req = urlopen(request, timeout=time_out)#, context=ctx)
        if 400 <= req.getcode() < 600:
            raise HTTPError('HTTP %s Error for url: %s' % (req.getcode(), url), response=req)
    except (HTTPError, URLError) as err:
        print("Error occured:")
        print(err)
    chunk = req.read()
    req.close()
    return chunk

class HTTPFile:
    def __init__(self, url, cache_size):
        self.url = url
        self.position = 0
        self.filesize = http_file_size(url)
        self.cache_size = cache_size
        self.cache_start = 0
        self.cache_end = 0
        self.cache = b''
        self.debug_number_of_requests = 0
        print("New HTTPFile created with filesize {} and URL {}".format(self.filesize, url))

    def seekable(self):
        return True

    def seek(self, pos, from_what=0):
        if from_what == 0:
            pass
        elif from_what == 1:
            pos = self.position + pos
        elif from_what == 2:
            pos = self.filesize + pos
        self.position = max(0, min(pos, self.filesize))
        print("Seek to {}".format(self.position))

    def tell(self):
        return self.position

    def close(self):
        pass

    def read(self, size=-1):
        if size <= -1:
            size = 1024 * 1024
            print("Full read requested")
        elif size == 0:
            print("Zero byte read requested!")
            return b''
        print("Reading at position {} exactly {} Bytes".format(self.position, size))
        end = min(self.position + size, self.filesize)
        if self.cache_start <= self.position and end <= self.cache_end:
            print("Answering from cache from index {} to index {}".format(self.position, end))
            start_offset = self.position - self.cache_start
            end_offset = end - self.cache_start
            val = self.cache[start_offset:end_offset]
        else:
            before_caching = 1024 # support to read backwards one kilobyte, because zip files are parsed from file end backwards
            cache_start = self.position - before_caching
            if cache_start < 0:
                cache_start = 0
                before_caching = self.position
            cache_end = cache_start + max(self.cache_size, size + before_caching)
            self.debug_number_of_requests += 1
            print("+ {} + Requesting HTTP range {} {}".format(self.debug_number_of_requests, cache_start, cache_end))
            chunk = http_range(self.url, cache_start, cache_end)
            self.cache_start = cache_start
            self.cache_end = cache_end
            self.cache = chunk
            val = chunk[before_caching:before_caching+size]
        self.position = end
        print("Returning length {} and value {}".format(len(val), repr(val[:50])))
        return val

class ChromeOSImage:
    """
    The main class handling a Chrome OS image
    Information related to ext2 is sourced from here: https://www.nongnu.org/ext2-doc/ext2.html
    """

    def __init__(self, url, cache_size):
        """Prepares the image"""
        self.url = url
        self.cache_size = cache_size
        self.bstream = self.get_bstream(url, cache_size)
        self.part_offset = None
        self.sb_dict = None
        self.blocksize = None
        self.blk_groups = None

    def gpt_header(self):
        """Returns the needed parts of the GPT header, can be easily expanded if necessary"""
        header_fmt = '<8s4sII4x4Q16sQ3I'
        header_size = calcsize(header_fmt)
        lba_size = 512  # assuming LBA size
        self.seek_stream(lba_size)

        # GPT Header entries: signature, revision, header_size, header_crc32, (reserved 4x skipped,) current_lba, backup_lba,
        #                     first_usable_lba, last_usable_lba, disk_guid, start_lba_part_entries, num_part_entries,
        #                     size_part_entry, crc32_part_entries
        _, _, _, _, _, _, _, _, _, start_lba_part_entries, num_part_entries, size_part_entry, _ = unpack(header_fmt, self.read_stream(header_size))

        return (start_lba_part_entries, num_part_entries, size_part_entry)

    def chromeos_offset(self):
        """Calculate the Chrome OS losetup start offset"""
        part_format = '<16s16sQQQ72s'
        entries_start, entries_num, entry_size = self.gpt_header()  # assuming partition table is GPT
        lba_size = 512  # assuming LBA size
        self.seek_stream(entries_start * lba_size)

        if not calcsize(part_format) == entry_size:
            print('Partition table entries are not 128 bytes long')
            return 0

        for index in range(1, entries_num + 1):  # pylint: disable=unused-variable
            # Entry: type_guid, unique_guid, first_lba, last_lba, attr_flags, part_name
            _, _, first_lba, _, _, part_name = unpack(part_format, self.read_stream(entry_size))
            part_name = part_name.decode('utf-16').strip('\x00')
            if part_name == 'ROOT-A':  # assuming partition name is ROOT-A
                offset = first_lba * lba_size
                break

        if not offset:
            print('Failed to calculate losetup offset.')
            return 0

        return offset

    def extract_file(self, filename, extract_path):
        """Extracts the file from the image"""
        self.part_offset = self.chromeos_offset()
        self.sb_dict = self.superblock()
        self.blk_groups = self.block_groups()

        bin_filename = filename.encode('ascii')
        chunksize = 4 * 1024**2
        percent8 = 40
        chunk1 = self.read_stream(chunksize)
        while True:
            chunk2 = self.read_stream(chunksize)
            if not chunk2:
                print('File {filename} not found in the ChromeOS image', filename=filename)
                return False

            chunk = chunk1 + chunk2
            if bin_filename in chunk:
                i_index_pos = chunk.index(bin_filename) - 8
                dir_dict = self.dir_entry(chunk[i_index_pos:i_index_pos + len(filename) + 8])
                if dir_dict['inode'] < self.sb_dict['s_inodes_count'] and dir_dict['name_len'] == len(filename):
                    break
            chunk1 = chunk2
            if percent8 < 240:
                percent8 += 1

        blk_group_num = (dir_dict['inode'] - 1) // self.sb_dict['s_inodes_per_group']
        blk_group = self.blk_groups[blk_group_num]
        i_index_in_group = (dir_dict['inode'] - 1) % self.sb_dict['s_inodes_per_group']

        inode_pos = self.part_offset + self.blocksize * blk_group['bg_inode_table'] + self.sb_dict['s_inode_size'] * i_index_in_group
        inode_dict, _ = self.inode_table(inode_pos)

        return self.write_file(inode_dict, os.path.join(extract_path, filename))

    def superblock(self):
        """Get relevant info from the superblock, assert it's an ext2 fs"""
        names = ('s_inodes_count', 's_blocks_count', 's_r_blocks_count', 's_free_blocks_count', 's_free_inodes_count', 's_first_data_block',
                 's_log_block_size', 's_log_frag_size', 's_blocks_per_group', 's_frags_per_group', 's_inodes_per_group', 's_mtime', 's_wtime',
                 's_mnt_count', 's_max_mnt_count', 's_magic', 's_state', 's_errors', 's_minor_rev_level', 's_lastcheck', 's_checkinterval',
                 's_creator_os', 's_rev_level', 's_def_resuid', 's_def_resgid', 's_first_ino', 's_inode_size', 's_block_group_nr',
                 's_feature_compat', 's_feature_incompat', 's_feature_ro_compat', 's_uuid', 's_volume_name', 's_last_mounted',
                 's_algorithm_usage_bitmap', 's_prealloc_block', 's_prealloc_dir_blocks')
        fmt = '<13I6H4I2HI2H3I16s16s64sI2B818x'
        fmt_len = calcsize(fmt)

        self.seek_stream(self.part_offset + 1024)  # superblock starts after 1024 byte
        pack = self.read_stream(fmt_len)
        sb_dict = dict(list(zip(names, unpack(fmt, pack))))

        sb_dict['s_magic'] = hex(sb_dict['s_magic'])
        assert sb_dict['s_magic'] == '0xef53'  # assuming/checking this is an ext2 fs

        block_groups_count1 = sb_dict['s_blocks_count'] / sb_dict['s_blocks_per_group']
        block_groups_count1 = int(block_groups_count1) if float(int(block_groups_count1)) == block_groups_count1 else int(block_groups_count1) + 1
        block_groups_count2 = sb_dict['s_inodes_count'] / sb_dict['s_inodes_per_group']
        block_groups_count2 = int(block_groups_count2) if float(int(block_groups_count2)) == block_groups_count2 else int(block_groups_count2) + 1
        assert block_groups_count1 == block_groups_count2
        sb_dict['block_groups_count'] = block_groups_count1

        self.blocksize = 1024 << sb_dict['s_log_block_size']

        return sb_dict

    def block_group(self):
        """Get info about a block group"""
        names = ('bg_block_bitmap', 'bg_inode_bitmap', 'bg_inode_table', 'bg_free_blocks_count', 'bg_free_inodes_count', 'bg_used_dirs_count', 'bg_pad')
        fmt = '<3I4H12x'
        fmt_len = calcsize(fmt)

        pack = self.read_stream(fmt_len)
        blk = unpack(fmt, pack)

        blk_dict = dict(list(zip(names, blk)))

        return blk_dict

    def block_groups(self):
        """Get info about all block groups"""
        if self.blocksize == 1024:
            self.seek_stream(self.part_offset + 2 * self.blocksize)
        else:
            self.seek_stream(self.part_offset + self.blocksize)

        blk_groups = []
        for i in range(self.sb_dict['block_groups_count']):  # pylint: disable=unused-variable
            blk_group = self.block_group()
            blk_groups.append(blk_group)

        return blk_groups

    def inode_table(self, inode_pos):
        """Reads and returns an inode table and inode size"""
        names = ('i_mode', 'i_uid', 'i_size', 'i_atime', 'i_ctime', 'i_mtime', 'i_dtime', 'i_gid', 'i_links_count', 'i_blocks', 'i_flags',
                 'i_osd1', 'i_block0', 'i_block1', 'i_block2', 'i_block3', 'i_block4', 'i_block5', 'i_block6', 'i_block7', 'i_block8',
                 'i_block9', 'i_block10', 'i_block11', 'i_blocki', 'i_blockii', 'i_blockiii', 'i_generation', 'i_file_acl', 'i_dir_acl', 'i_faddr')
        fmt = '<2Hi4I2H3I15I4I12x'
        fmt_len = calcsize(fmt)
        inode_size = self.sb_dict['s_inode_size']

        self.seek_stream(inode_pos)
        pack = self.read_stream(fmt_len)
        inode = unpack(fmt, pack)

        inode_dict = dict(list(zip(names, inode)))
        inode_dict['i_mode'] = hex(inode_dict['i_mode'])

        blocks = inode_dict['i_size'] / self.blocksize
        inode_dict['blocks'] = int(blocks) if float(int(blocks)) == blocks else int(blocks) + 1

        self.read_stream(inode_size - fmt_len)
        return inode_dict, inode_size

    @staticmethod
    def dir_entry(chunk):
        """Returns the directory entry found in chunk"""
        dir_names = ('inode', 'rec_len', 'name_len', 'file_type', 'name')
        dir_fmt = '<IHBB' + str(len(chunk) - 8) + 's'

        dir_dict = dict(list(zip(dir_names, unpack(dir_fmt, chunk))))

        return dir_dict

    def iblock_ids(self, blk_id, ids_to_read):
        """Reads the block indices/IDs from an indirect block"""
        seek_pos = self.part_offset + self.blocksize * blk_id
        self.seek_stream(seek_pos)
        fmt = '<' + str(int(self.blocksize / 4)) + 'I'
        ids = list(unpack(fmt, self.read_stream(self.blocksize)))
        ids_to_read -= len(ids)

        return ids, ids_to_read

    def iiblock_ids(self, blk_id, ids_to_read):
        """Reads the block indices/IDs from a doubly-indirect block"""
        seek_pos = self.part_offset + self.blocksize * blk_id
        self.seek_stream(seek_pos)
        fmt = '<' + str(int(self.blocksize / 4)) + 'I'
        iids = unpack(fmt, self.read_stream(self.blocksize))

        ids = []
        for iid in iids:
            if ids_to_read <= 0:
                break
            ind_block_ids, ids_to_read = self.iblock_ids(iid, ids_to_read)
            ids += ind_block_ids

        return ids, ids_to_read

    def seek_stream(self, seek_pos):
        """Move position of bstream to seek_pos"""
        try:
            self.bstream[0].seek(seek_pos)
            self.bstream[1] = seek_pos
            return

        except UnsupportedOperation:
            chunksize = 4 * 1024**2

            if seek_pos >= self.bstream[1]:
                while seek_pos - self.bstream[1] > chunksize:
                    self.read_stream(chunksize)
                self.read_stream(seek_pos - self.bstream[1])
                return

            self.bstream[0].close()
            self.bstream[1] = 0
            self.bstream = self.get_bstream(self.url, self.cache_size)

            while seek_pos - self.bstream[1] > chunksize:
                self.read_stream(chunksize)
            self.read_stream(seek_pos - self.bstream[1])

            return

    def read_stream(self, num_of_bytes):
        """Read and return a chunk of the bytestream"""
        self.bstream[1] += num_of_bytes

        return self.bstream[0].read(num_of_bytes)

    def get_block_ids(self, inode_dict):
        """Get all block indices/IDs of an inode"""
        ids_to_read = inode_dict['blocks']
        block_ids = [inode_dict['i_block' + str(i)] for i in range(12)]
        ids_to_read -= 12

        if not inode_dict['i_blocki'] == 0:
            iblocks, ids_to_read = self.iblock_ids(inode_dict['i_blocki'], ids_to_read)
            block_ids += iblocks
        if not inode_dict['i_blockii'] == 0:
            iiblocks, ids_to_read = self.iiblock_ids(inode_dict['i_blockii'], ids_to_read)
            block_ids += iiblocks

        return block_ids[:inode_dict['blocks']]

    def read_file(self, block_ids):
        """Read blocks specified by IDs into a dict"""
        block_dict = {}
        for block_id in block_ids:
            percent = int(35 + 60 * block_ids.index(block_id) / len(block_ids))
            seek_pos = self.part_offset + self.blocksize * block_id
            self.seek_stream(seek_pos)
            block_dict[block_id] = self.read_stream(self.blocksize)

        return block_dict

    @staticmethod
    def write_file_chunk(opened_file, chunk, bytes_to_write):
        """Writes bytes to file in chunks"""
        if len(chunk) > bytes_to_write:
            opened_file.write(chunk[:bytes_to_write])
            return 0

        opened_file.write(chunk)
        return bytes_to_write - len(chunk)

    def write_file(self, inode_dict, filepath):
        """Writes file specified by its inode to filepath"""
        bytes_to_write = inode_dict['i_size']
        block_ids = self.get_block_ids(inode_dict)

        block_ids_sorted = block_ids[:]
        block_ids_sorted.sort()
        block_dict = self.read_file(block_ids_sorted)

        write_dir = os.path.join(os.path.dirname(filepath), '')

        with open(filepath, 'wb') as opened_file:
            for block_id in block_ids:
                bytes_to_write = self.write_file_chunk(opened_file, block_dict[block_id], bytes_to_write)
                if bytes_to_write == 0:
                    return True

        return False

    @staticmethod
    def get_bstream(url, cache_size):
        """Get a bytestream of the image"""
        if url.endswith('.zip'):
            bstream = ZipFile(HTTPFile(url, cache_size), 'r').open(os.path.basename(url).strip('.zip'), 'r')  # pylint: disable=consider-using-with
        else:
            bstream = open(imgpath, 'rb')  # pylint: disable=consider-using-with

        return [bstream, 0]

if __name__ == "__main__":
    link = "https://dl.google.com/dl/edgedl/chromeos/recovery/chromeos_14324.62.0_bob_recovery_stable-channel_mp.bin.zip"
    cache_size = 1024*1024*104
    os_image = ChromeOSImage(link, cache_size)
    extracted = os_image.extract_file(filename="libwidevinecdm.so", extract_path=".")

mediaminister commented 2 years ago

Thanks for coming up with this interesting proof of concept!

However I see some problems to make this the main approach in our add-on for getting the Widevine CDM on ARM devices:

Our users worldwide don't always have very fast internet connections, so we should keep bandwidth usage to a minimum for ordinary users.
This approach seems much slower than downloading a single 1 GB image? Not sure.

Feel free to come up with a PR implementing this as an option for "expert users". After some more testing, I guess this can be merged. When I find some time, I'll take another look at this.

floyd-fuh commented 2 years ago

Thanks for considering.

It is currently only slower than downloading a single 1GB image because I haven't implemented caching on disk and I haven't implemented multi-chunk caching (currently only caches one chunk). Therefore it downloads large parts of the zip file twice, which is of course not optimal.

I have a different view on it: As the implementation allows to decided what happens (use memory, disc or more connections), we can just make the default behave just the same as now. How about:

If more than 1.5GB memory is free and available: Use 1 HTTP request, store chunks in-memory
If more than 1.5GB disc is available: 1 HTTP request, store on disc
Use 50% of available memory as the chunk size and only cache 1 chunk, do as many HTTP requests as necessary

That would probably make thinks faster compared to now for people who have enough memory (e.g. Raspberry Pi 4 with 4 or 8 GB RAM). I guess it should be no issue to "resort to the next strategy" as a fallback if something goes wrong in the approach chosen.

I have a couple of questions:

Can we determine how much memory is free/available to us? I saw that we know how much disc space is available at https://github.com/emilsvennesson/script.module.inputstreamhelper/blob/c97af2116594a8b372920d614122f9acb4b1bbd9/lib/inputstreamhelper/utils.py#L231
What is the easiest way to setup a development environment with inputstreamhelper? I guess I have to use a RasperryPi or something to test the ARM setup? Or do you have a nice virtualized environment (ARM VM?) I could download?
Any IDE you are using?

I'm currenlty still thinking about how I could visualize which chunks are necessary from the zip file at all.

mediaminister commented 2 years ago

Can we determine how much memory is free/available to us?

This is not implemented but it should be possible using a standard linux command and execute this with run_cmd() in utils.py

What is the easiest way to setup a development environment with inputstreamhelper?

There is no easy development environment. This repo has CI testing and a Makefile to test the ARM code on a Linux system. There exists VM images for LibreELEC that work out of the box in VMware Workstation Player. (Check for LibreELEC-Generic.x86_64-10.0.1.ova on https://wiki.libreelec.tv/project/mirrors) To execute the ARM code in VMware, you can change the arch() function in utils.py to always return 'arm'.

# Disable cache
# arch.cached = sys_arch
# Hardcode arm arch
sys_arch = 'arm'
return sys_arch

Any IDE you are using?

No, I use a text editor and a symlink from a local git repo to a real Kodi installation

ln -s ~/script.module.inputstreamhelper/ ~/.kodi/addons/

And I enabled debug logging in advancedsettings.xml in ~/.kodi/userdata/

<advancedsettings>
  <loglevel>1</loglevel>
</advancedsettings>

To speed up testing on a real Kodi installation you can automatically execute add-on functions on startup. https://kodi.wiki/view/Autoexec_Service You can auto execute the scripts from the api.py

Automatically remove Widevine with autoexec

import xbmc
xbmc.executebuiltin('RunScript(script.module.inputstreamhelper, widevine_remove)')

Automatically install Widevine with autoexec

import xbmc
xbmc.executebuiltin('RunScript(script.module.inputstreamhelper, widevine_install)')

emilsvennesson / script.module.inputstreamhelper

Seeking remote file with HTTP Range header #479