cclgroupltd / ccl_chromium_reader

(Sometimes partial) Python re-implementations of the technologies involved in reading various data sources in Chrome-esque applications.
MIT License
141 stars 33 forks source link

ValueError: Didn't get version tag in the header #1

Closed joliveira98 closed 1 year ago

joliveira98 commented 3 years ago

Hello,

I've been trying to retrieve the key/value information from a specific leveldb on Chrome IndexedDB and I keep getting a ValueError exception.

ValueError: Didn't get version tag in the header

This happens on the records iteration and it crashes on the following verification:

def _read_header(self) -> int:
        tag = self._read_tag()
        if tag != Constants.token_kVersion:
            raise ValueError("Didn't get version tag in the header")
        version = self._read_le_varint()[0]
        return version

Apparently my version tag value is 0x01 and it should be 0xff.

This only appears with one specific leveldb. Do you know why is the version tag with this value? Shouldn't it be the 0xff instead of 0x01

cclgroupltd commented 3 years ago

Hi there,

I'll need to dive back into the Chrome source to work out what may be happening/if it's allowed to happen. If it's not sensitive, is it possible to share the blob of data which causes the error? Is it every value in the leveldb or do you get through some first? Is the database representative of recent data, or is it a little older?

joliveira98 commented 3 years ago

Hello,

If it's not sensitive, is it possible to share the blob of data which causes the error? I prefer not to share the blob because the data is from a Zoom meeting in the browser so I'm not sure what type of information can be stored there regarding my Zoom account. In order to reproduce the leveldb and blob you just need to host a Zoom meeting on the browser and the leveldb and blob will be created. I believe this behaviour will happen to every leveldb and blob related to Zoom web meetings.

Is it every value in the leveldb or do you get through some first? Every value in the leveldb for Zoom raises this exception.

Is the database representative of recent data, or is it a little older? It is recent. The data was created the day I tested it. 2 days ago.

I've tested the script with twitter and Google Drive related blobs and it worked fine. Also those blobs are recent, December 10.

cclgroupltd commented 3 years ago

Great, thank you for the details. I will get back to you when I've had a chance to have a look at some data.

docelic commented 2 years ago

Same issue. With print added in _read_header:

    def _read_header(self) -> int:
        tag = self._read_tag()
        print('tag/token_kVersion', tag, Constants.token_kVersion)
        if tag != Constants.token_kVersion:
            raise ValueError("Didn't get version tag in the header")
        version = self._read_le_varint()[0]
        return version

The output ends up being:

<WrappedDatabase: id=1; name=somedb; origin=https_someweb_0@1>
<WrappedObjectStore: object_store_id=1; name=somestore>

tag/token_kVersion b'\xff' b'\xff'
tag/token_kVersion b'\xff' b'\xff'
tag/token_kVersion b'\xff' b'\xff'
tag/token_kVersion b'\xff' b'\xff'
tag/token_kVersion b'\x01' b'\xff'
Traceback (most recent call last):
    for record in store.iterate_records():
    yield from self._raw_db.iterate_records(
    deserializer = ccl_v8_value_deserializer.Deserializer(
    self.version = self._read_header()
    raise ValueError("Didn't get version tag in the header")
ValueError: Didn't get version tag in the header
cclgroupltd commented 2 years ago

Hi,

Are you able to share the data that raised this error? If you comment out the if/raise lines in there, does the code proceed as expected?

intelligentpotato commented 1 year ago

Hi,

Thank you for putting so much effort into developing such a tool and open sourcing it.

I am working on recovering Proton Mail messages from cache and ran into the same issue.

If I comment out the raise ValueError("Didn't get version tag in the header") another exception raises at line 600 of ccl_v8_value_deserializer.py:

if func is None:
  raise ValueError(f"Unknown tag {tag}")
Traceback (most recent call last):
  File "/root/ccl_chrome_indexeddb/raw.py", line 31, in <module>
    for record in db.iterate_records(db_id_meta.dbid_no, obj_store_id):
  File "/root/ccl_chrome_indexeddb/ccl_chromium_indexeddb.py", line 564, in iterate_records
    value = deserializer.read()
  File "/root/ccl_chrome_indexeddb/ccl_v8_value_deserializer.py", line 627, in read
    return self._read_object()
  File "/root/ccl_chrome_indexeddb/ccl_v8_value_deserializer.py", line 611, in _read_object
    tag, o = self._read_object_internal()
  File "/root/ccl_chrome_indexeddb/ccl_v8_value_deserializer.py", line 600, in _read_object_internal
    raise ValueError(f"Unknown tag {tag}")
ValueError: Unknown tag b'\xff'

I am willing to share the dataset with you privately if it helps, just let me know the email where I can send it.

cclgroupltd commented 1 year ago

Hi there,

So yes, if you'd like to share the data that might help and you can do so on alex[dot]caithness[at]cclsolutionsgroup[dot]com - it may be best to share it to dropbox or similar so that it doesn't get eaten by filters.

There may be things already in the code to help though - if you check out the code here: https://github.com/cclgroupltd/ccl_chrome_indexeddb#wrapper-api and in particular:

for record in obj_store.iterate_records(
        errors_to_stdout=True, 
        bad_deserializer_data_handler= lambda k,v: print(f"error: {k}, {v}")):
    print(record.user_key)
    print(record.value)

There is a way of calling the iterate_records function which can include a function callback to handle errors (or print them to stdout instead of raising them) - if the record that is causing the error is malformed, this would be the way to deal with it.

Let me know how you get on.

dg-data commented 1 year ago

Hello,

The record structure changed in newer Blink versions. First, the IDB value wrapping, see at https://chromium.googlesource.com/chromium/src/+/refs/heads/main/third_party/blink/renderer/modules/indexeddb/idb_value_wrapping.cc The wrapping detection logic in IDBValueUnwrapper::IsWrapped() must be able to distinguish between SSV byte sequences produced and byte sequences expressing the fact that an IDBValue has been wrapped and requires post-processing. SSV processing command replacing the SSV data bytes with a Blob's contents. 1) 0xFF - kVersionTag 2) 0x11 - kRequiresProcessingSSVPseudoVersion 3) 0x01 - kReplaceWithBlob 4) varint - Blob size 5) varint - the offset of the SSV-wrapping Blob in the IDBValue list of Blobs

The python code expects a version tag in position 3 which is replaced by 0x01 in this case.

The other change in the Blink envelope, https://github.com/chromium/chromium/blob/main/third_party/blink/renderer/bindings/core/v8/serialization/v8_script_value_deserializer.cc

// These versions expect a trailer offset in the envelope.
if (version >= TrailerReader::kMinWireFormatVersion) {
      static constexpr size_t kTrailerOffsetDataSize = 1 + sizeof(uint64_t) + sizeof(uint32_t);

So in iterate_records now should be something like


                require_processing = record.value[val_idx]
                if require_processing == 0x01:
                    val_idx += 1
                    blob_size, varint_raw = _le_varint_from_bytes(record.value[val_idx:])
                    val_idx += len(varint_raw)
                    blob_offset, varint_raw = _le_varint_from_bytes(record.value[val_idx:])
                    val_idx += len(varint_raw)

               # trailer offset
                if blink_version >= 21:
                    val_idx += 1+8+4 # 1 + uint_64_t + uint_32_t
cclgroupltd commented 1 year ago

@dg-data thanks for highlighting this - it has highlighted some other changes in recent versions which I also need to address, so I'll be jumping on that as soon as I have a chance.

cclgroupltd commented 1 year ago

@dg-data a little more context in case you're interested - on the blob wrapping side of things, it looks like this happens if the serialized data exceeds kIDBWrapThreshold which is 65536 (at the moment). In that case the serialized data is referenced in a blob entry for that key - functionally I think that means that it's in a separate file on disk, but I'm putting together test data at the moment to check exactly what is going on...

Edit: confirmed, that's exactly what's going on. that complicates things a bit, but a lot of the groundwork is already in the code so it's not too bad.

dg-data commented 1 year ago

@cclgroupltd Hi! Thanks for your quick reaction. The basic logic stayed untouched in Blink I think, class IndexedDBExternalObject and blob handling looks good. The route to the data comes from records with a key where index ID is 3 – the “external object table” and the blob info in it. Deserialization of that blob should work,

cclgroupltd commented 1 year ago

@dg-data yep, the logic for looking up the external data is already in our module because it's how "File objects" are accessed. It shouldn't be too tricky to plumb that all in.

cclgroupltd commented 1 year ago

@dg-data could you give the most recent commit a go if you have suitable test data? It is working with my test data, but if you have real-world data to run it against that would be useful!

dg-data commented 1 year ago

@cclgroupltd Thanks for the update. I checked the most recent version and found no mistakes. I tested various records (Blink v.17, 20, 21) including the ones with externally serialized objects. As far as I see it looks pretty good, at least solved the issue I had. Great work, saved my data!

cclgroupltd commented 1 year ago

Fantastic. I'll close this issue now then, thanks for you help!