cclgroupltd / ccl_chromium_reader

(Sometimes partial) Python re-implementations of the technologies involved in reading various data sources in Chrome-esque applications.
MIT License
134 stars 34 forks source link

Feat: improve speed of metadata collection by caching #19

Closed lxndrblz closed 8 months ago

lxndrblz commented 8 months ago

Hi Alex,

I have noticed that the metadata collection of the IndexedDB is still very slow as the database gets iterated up to 5 times.

In this PR I propose a quicker way of doing things, by caching the records of the raw database.

Furthermore, I have reduced some redundant code from the _fetch_meta_data function and instead calling the individual functions responsible for fetching the metadata for global, DB and the individual object stores.

Words of Caution:

Let me know what you think.

cclgroupltd commented 8 months ago

@lxndrblz thanks loads for that, I'll test it out, but after a skim-read I can't see anything controversial (although it would be interesting to see what the memory usage looks like on big databases - probably fine).

I noticed you've got a test in there for dbid_no being None in Database IDs - I'm trying to work out a) how that would happen (I think it could only happen for an empty record as that's the only thing in the record value for the records that's built from) and b) whether it would make more sense to omit them at source rather than later as you're doing there?

As I say though - I'm not sure how it happens. I would have assumed that it would only happen with deleted records and I'm pretty sure when it's populated it's only getting live metadata records?

Do you have examples of data where it happens?

lxndrblz commented 8 months ago

@cclgroupltd If you want, we can remove the None checks for now and address that in a separate PR.

Tbh, I am not exactly sure how that happens, but I saw it for the first time last weekend when I was trying to process a Teams Database from a customer. Unfortunately, due to confidentiality reasons I am unable to to share a copy of the database.

I have a collection of other DBs that were given to me over time, I will check if the sighting can be reproduced with one of these.

cclgroupltd commented 8 months ago

One thing you could try on that confidential dataset is popping a trace in GlobalMetada.__init__:

class GlobalMetadata:
    def __init__(self, raw_meta_dict: dict):
        # TODO: more of these meta types if required
        self.backing_store_schema_version = None
        if raw_schema_version := raw_meta_dict.get("\x00\x00\x00\x00\x00"):
            self.backing_store_schema_version = le_varint_from_bytes(raw_schema_version)

        self.max_allocated_db_id = None
        if raw_max_db_id := raw_meta_dict.get("\x00\x00\x00\x00\x01"):
            self.max_allocated_db_id = le_varint_from_bytes(raw_max_db_id)

        database_ids_raw = (raw_meta_dict[x] for x in raw_meta_dict
                            if x.startswith(b"\x00\x00\x00\x00\xc9"))

        dbids = []
        for dbid_rec in database_ids_raw:
            with io.BytesIO(dbid_rec.key[5:]) as buff:
                origin_length = read_le_varint(buff)
                origin = buff.read(origin_length * 2).decode("utf-16-be")
                db_name_length = read_le_varint(buff)
                db_name = buff.read(db_name_length * 2).decode("utf-16-be")

            db_id_no = le_varint_from_bytes(dbid_rec.value)

            # ADD TRACE HERE e.g.:
            if db_id_no is None:
                print(f"TRACE: {dbid_rec.value.hex()}")

            dbids.append(DatabaseId(db_id_no, origin, db_name))

        self.db_ids = tuple(dbids)

I strongly suspect that will show an empty string (in which case it's still somewhat of a mystery), but if it isn't empty, then that's where it gets interesting...

lxndrblz commented 8 months ago

@cclgroupltd Hope this helps.

TRACE: 89 TRACE: c4 TRACE: 85 TRACE: 81 TRACE: 83 TRACE: 80 TRACE: 82 TRACE: 84 TRACE: 86 TRACE: 87 DatabaseId(dbid_no=None, origin='https_teams.microsoft.com_0@1', name='topics-sdk')

cclgroupltd commented 8 months ago

@lxndrblz thank you so much for that - it's highlighted something that I had overlooked which is that the value is not a "proper varint" (which I've now confirmed by generating my own test data and then referring back to the chromium source).

In this case None was being returned as the varint function was running out of data to read (because all of those values you shared should be multi-byte values, where they clearly aren't).

Real varints are encoded by using the msb of each byte to indicate whether the next byte should be read. These are "variable length" integers in that they can be of a variable length, but they are just int64s with any unused bytes missing (little endian so it's safe to do).

"Luckily" the None previously crashing things was protecting from data being "missed" as the module would just crash, but this needs fixing before we do anything else. It's not a complicated fix so I'll get it done and ping you when it's done.

This also affects my .NET library too, so doubly thanks for bringing it to my attention.

cclgroupltd commented 8 months ago

30aa895 fixes the incorrect reading of those records, which should completely remove the requirement for that None check. Could you give that a go with your data and let me know if it works for you (and if so, could you remove that check from this PR please?).

lxndrblz commented 8 months ago

@cclgroupltd Thanks for your quick fix. I will give it a go today and report back. Once confirmed, I'll also remove the null checks from the PR.

lxndrblz commented 8 months ago

@cclgroupltd I can confirm that the Null checks are no longer necessary for the databases. Therefore, I have removed the relevant pieces of code from the PR.

cclgroupltd commented 8 months ago

@lxndrblz great, thank you for all your work on this.

I should be able to run it through with some of my test data later today. As I said at the start, it all looks good at a glance, but I will sleep better knowing I've tested it myself!

cclgroupltd commented 8 months ago

Pinging you @lxndrblz to let you know I've had a test of it, and merged it (via another testing branch where I tidied some stuff up).

Thanks again for all your work!