cclgroupltd / ccl_chromium_reader

(Sometimes partial) Python re-implementations of the technologies involved in reading various data sources in Chrome-esque applications.
MIT License
141 stars 33 forks source link

ValueError: Blink type tag not present #2

Closed etikatech closed 5 months ago

etikatech commented 3 years ago

Hi,

Thanks for your efforts in developing this code! Much appreciated.

I have a problem with the code stopping during the iterate_records function within ccl_chromium_indexeddb.py.

The code stops here:

               if blink_type_tag != 0xff:
                    # TODO: probably don't want to fail hard here long term...
                    raise ValueError("Blink type tag not present")

I'm finding that for some records in my Skype IndexedDB, have a blink_type_tag of 16, causing the code to stop at the above location.

I've implemented a temporary workaround to allow the code to print a message and continue rather than stopping at this point:

               if blink_type_tag != 0xff:
                    # TODO: probably don't want to fail hard here long term...
                    #raise ValueError("Blink type tag not present")
                    print("***** Skipping record with unknown blink_type_tag: " + str(blink_type_tag) + " *****")
                    continue

After running the above modified code, I found that for my Skype IndexedDB, I get approximately 5,000 affected records with a blink_type_tag of 16, out of a total of 20,000 records.

On a potentially related matter, I'm also seeing 'ccl_v8_value_deserializer._Undefined object at 0xXXX' errors when printing out the values: (yes, this might be a direct consequence of my code modification above, but thought you should know).

print(f"key: {record.value}")

key: {'_serverMessages': [{'id': '1525335581949', 'originalarrivaltime': '2018-05-03T08:19:38.846Z', 'messagetype': 'RichText', 'version': '1525335581949', 'composetime': '2018-05-03T08:19:38.846Z', 'clientmessageid': '10187620813155729536', 'conversationLink': 'https://hk2-2-client-s.gateway.messenger.live.com/v1/users/ME/conversations/removed@thread.skype', 'content': 'OK. I&apos;ll provide you the photos that we could do for the embossed texture. ', 'type': 'Message', 'conversationid': 'removed@thread.skype', 'from': 'https://hk2-2-client-s.gateway.messenger.live.com/v1/users/ME/contacts/8:removed'}], 'cuid': '10187620813155729536', 'conversationId': 'removed@thread.skype', 'createdTime': 1525335581949.0, 'creator': '8:removed', 'content': 'OK. I&apos;ll provide you the photos that we could do for the embossed texture. ', 'messagetype': 'RichText', 'contenttype': <ccl_v8_value_deserializer._Undefined object at 0x7eff208b68b0>, 'properties': <ccl_v8_value_deserializer._Undefined object at 0x7eff208b68b0>, '_isEphemeral': False, '_fileEncryptionKeys': <ccl_v8_value_deserializer._Undefined object at 0x7eff208b68b0>, '_countsType': 1, '_isMyMessage': 0}

Any assistance would be greatly appreciated!

cclgroupltd commented 3 years ago

Hi there, thanks for the report.

I'll do the easy one first: the undefined thing you're seeing isn't an error - it's representing the JavaScript "undefined" value (which is different from null (or None in Python) - the problem there is that there's not a nice __str__ or __repr__ defined for that class - that's an easy fix practically, and that'll be easy to do (and I'll get it done shortly).

cclgroupltd commented 3 years ago

So the version tag issue: I'll go back into the Chrome source and have a look at what is permissible there - I vaguely recall reading something in there about the version tag being able to deviate, but I wasn't able to generate any test data which did so. If it's not sensitive, are you able to share one of the blobs from the leveldb which exhibits this so I can cross reference it with the source?

etikatech commented 3 years ago

Thanks for the reply. I assume you are referring to the "Blink Type tag" issue, rather than "version tag issue"? Or are these related?

I can try and find a non-sensitive blob for you. Can you provide guidance on how I could extract one for you from the leveldb? And what format to supply it in? (Or could I achieve this by inserting a print 'something' command as part of my workaround code above?)

cclgroupltd commented 3 years ago

Hi first - I've pushed an update that'll stop making the Undefined values look weird!

To export the blob, maybe temporarily alter your update to:

if blink_type_tag != 0xff:
    # TODO: probably don't want to fail hard here long term...
    #raise ValueError("Blink type tag not present")
    with open(f"example_data_blink_tag_{record.seq}.bin", "wb") as temp_out:
        temp_out.write(record.value)
    continue

That should dump the contents of each nonconforming record to a file in your working directory... that is going to create 5000ish files of course based on your example above, so you may want to add a counter or just break the script part way through!

etikatech commented 3 years ago

Hi,

Thanks for the update. The new "undefined" message definitely looks better in the output now.

I've had a look at the contents of the nonconforming records by printing out record.value. I can see that they definitely contain Skype messages, which are of interest to us.

I can send you a few sample dump files, as per your earlier suggestion, but due to their potentially sensitive nature, I will need to send these to you over a separate/private channel. (I don't think that GitHub supports such a mechanism). Would it be OK if I reach out to you directly via the CCL website contact page? Or do you have a better suggestion?

Thanks in advance.

cclgroupltd commented 3 years ago

Hi there, yes, please do; if you mark it fao Alex in R&D then it should get to me. I look forward to hearing from you.

etikatech commented 3 years ago

Thanks Alex. I just sent you a message via the CCL website.

cclgroupltd commented 3 years ago

Thank you for that. I'm having a rotten time trying to find a rationale for the data however...

I had written up my findings for the header where the problem resided, but I've just noticed something else peculiar about the data that I have no idea how I could have missed. In the files you sent after the initial version varint (or possibly even including it, but that makes one of the versions look incorrect), every 2 bytes are swapped. That's really weird isn't it? That isn't the case with the example message (with the Undefined values) you provided above.

Anyway - if you swap each pair of bytes starting at offset 3 in all of the examples you kindly provided to me, the data looks right... so now the mystery is...why the byte swapping? If it's a bug it is happening earlier in the process, but I can't see where or why. I'll have to think around this some more.

Alongside the mystery is a big API issue which is: the Exception shouldn't be happening without a way to handle it when you're iterating records in a for-loop. That's something I can address, probably with a switch to log errors rather than throw them, or with a callback function. Or both options maybe.

cclgroupltd commented 3 years ago

Commit 24b97c5c029dd09ffcd587a1d056bd40108c662d will make it easier to skip the bad data. Readme has been updated to show you how.

etikatech commented 3 years ago

Thanks for the update. Yes, I did notice that the affected messages had every 2 bytes swapped!

Let me know if I can assist with this. Feel free to supply code with debug statements that I can run against my data if you wish.

etikatech commented 3 years ago

Any updates on the cause of the byte swapping in some of the messages?

As a temporary workaround, I'm thinking of trying to write some code to do the byte swapping for us, so that we can successfully extract these messages. I'm ok with Python, but this particular task could take me a while to work out on my own.. Any chance you can suggest some suitable temporary code to do this?

I have tried running the script on a Skype IndexedDB from a different system and it worked just fine. The only difference between the two IndexedDBs that I can think of, is that the failing datastore will most likely contain some messages with non-latin characters.. (specifically Chinese characters). I'm not sure if this is related to the root cause of the issue, but I thought it might be worth mentioning.

Hope this helps. Thanks again.

cclgroupltd commented 3 years ago

Hi there, sorry I've been on leave so I haven't had a lot of time to dig. I can't see any obvious reason why it's happening at the moment regardless. If it was only in the text/in fixed length numerical values there may be some logic, but it's literally everything. I wonder if it's some kind of semi-secure delete attempt?

As for byte swapping - it looks like it's everything after offset 3 so something like (untested)

python

swappable_data = record.data[3:]
swapped_data = bytes(swappable_data[i - 1] if i % 2 == 1 else swappable_data[i + 1] for i in range(len(swappable_data) - 1))
if len(swappable_data) % 2 == 1:
    swapped_data += swappable_data[-1:]

(the condition at the end there makes assumptions about how the end of the swapped data is handled).

you'll need to trap the bad data and run it through the deserializer "manually", something like (obj_raw is your record data), the code, roughly can be found around line 450 of ccl_chromium_indexeddb something like:

blink_deserializer = ccl_blink_value_deserializer.BlinkV8Deserializer()
deserializer = ccl_v8_value_deserializer.Deserializer(obj_raw, host_object_delegate=blink_deserializer.read)
try:
    value = deserializer.read()  # value is the processed object
 except Exception:
    if bad_deserializer_data_handler is not None:
        bad_deserializer_data_handler(key, record.value)
    else:
        raise
cclgroupltd commented 5 months ago

Closing as inactive (also the problem was specific in this data it seems)