YuvrajRaghuvanshiS / WhatsApp-Key-Database-Extractor

The most advanced and complete solution for extracting WhatsApp key/DB from package directory (/data/data/com.whatsapp) without root access.
MIT License
443 stars 57 forks source link

"Unexpected end of data" when extracting tar #111

Closed Frankprog03 closed 1 year ago

Frankprog03 commented 1 year ago

Everything works fine until it comes to the final extraction of the tar archive. It is probably corrupted and I have no idea if it is caused by abe.jar or something else. I tried backing up at least 10 times trying with and without the backup password, but I get every time the same result. Extraction of the tar with third party software such as 7zip yields the same.

This is the log of the last stage of the script:

[Wednesday 03/05/2023, 16:34:37] >>> I am in view_extract.extract_ab(is_java_installed=True, is_tar_only=False)
[Wednesday 03/05/2023, 16:34:37] Found "whatsapp.ab" in "tmp" folder. Continuing... Size: 594847464 bytes.
[Wednesday 03/05/2023, 16:34:37] Enter a name for this user (default "user").: 
[Wednesday 03/05/2023, 16:34:37] Enter same password which you entered on device when prompted earlier.: ********
[Wednesday 03/05/2023, 16:34:37] Successfully unpacked "tmp/whatsapp.ab" to "tmp/whatsapp.tar". Size: 1358559744 bytes.
[Wednesday 03/05/2023, 16:34:37] >>> I am in view_extract.taking_out_main_files(username=user)
[Wednesday 03/05/2023, 16:34:37] Folder "extracted/" already exists.
[Wednesday 03/05/2023, 16:34:37] Folder "extracted/user" already exists.
[Wednesday 03/05/2023, 16:34:37] Taking out main files in "tmp/" folder temporarily.
[Wednesday 03/05/2023, 16:34:37] unexpected end of data
[Wednesday 03/05/2023, 16:34:37] >>> I am in view_extract.clean_tmp()
[Wednesday 03/05/2023, 16:34:37] Cleaning up "tmp/" folder...
[Wednesday 03/05/2023, 16:34:37] [WinError 32] Impossibile accedere al file. Il file è utilizzato da un altro processo: 'tmp/whatsapp.tar'
[Wednesday 03/05/2023, 16:34:37] >>> I am in view_extract.kill_me(reason=)

My device: Honor View 10 (BKL-L09) OS: Windows 10

YuvrajRaghuvanshiS commented 1 year ago

try running with --tar-only, and extract manually

aditeyabaral commented 1 year ago

@YuvrajRaghuvanshiS The error occurs regardless of the flag's usage. As @Frankprog03 posted, if you use the flag you end up with this error on Windows:

[Friday 05/05/2023, 11:43:27] >>> I am in view_extract.extract_ab(is_java_installed=True, is_tar_only=False)
[Friday 05/05/2023, 11:43:27] Found "whatsapp.ab" in "tmp" folder. Continuing... Size: 146647497 bytes.
[Friday 05/05/2023, 11:43:27] Enter a name for this user (default "user").: 
[Friday 05/05/2023, 11:43:27] Enter same password which you entered on device when prompted earlier.: ********
[Friday 05/05/2023, 11:43:27] Successfully unpacked "tmp/whatsapp.ab" to "tmp/whatsapp.tar". Size: 327728644 bytes.
[Friday 05/05/2023, 11:43:27] >>> I am in view_extract.taking_out_main_files(username=user)
[Friday 05/05/2023, 11:43:27] Folder "extracted/" already exists.
[Friday 05/05/2023, 11:43:27] Taking out main files in "tmp/" folder temporarily.
[Friday 05/05/2023, 11:43:27] unexpected end of data
[Friday 05/05/2023, 11:43:27] >>> I am in view_extract.clean_tmp()
[Friday 05/05/2023, 11:43:27] Cleaning up "tmp/" folder...
[Friday 05/05/2023, 11:43:27] [WinError 32] The process cannot access the file because it is being used by another process: 'tmp/whatsapp.tar'
[Friday 05/05/2023, 11:43:27] >>> I am in view_extract.kill_me(reason=)

However, if you do not use the flag, and then manually extract using tar -xvf user.tar, you get the following:

(...)
apps/com.whatsapp/db/media.db
apps/com.whatsapp/db/stickers.db-shm
apps/com.whatsapp/db/stickers.db-wal
apps/com.whatsapp/db/payments.db-shm
apps/com.whatsapp/db/payments.db-wal
apps/com.whatsapp/db/location.db-shm
apps/com.whatsapp/db/location.db-wal
apps/com.whatsapp/db/location.db
apps/com.whatsapp/db/msgstore.db-shm
apps/com.whatsapp/db/msgstore.db-wal
apps/com.whatsapp/db/msgstore.db
tar: Unexpected EOF in archive
tar: rmtlseek not stopped at a record boundary
tar: Error is not recoverable: exiting now

I tried running this on Windows 11 and WSL2. The tar fails to extract on both OS.

System information:

Java: OpenJDK 1.8.0_362 (Java 8)

{
  "Architecture": "x86_64",
  "Hostname": "AN515-45",
  "Platform": "Linux",
  "Platform Release": "5.15.90.1-microsoft-standard-WSL2",
  "Platform Version": "#1 SMP Fri Jan 27 02:56:13 UTC 2023",
  "Processor": "x86_64",
  "RAM": "7 GB",
  "Python": [
    "main",
    "Mar 10 2023 10:55:28"
  ]
}
Frankprog03 commented 1 year ago

@aditeyabaral the problem seems to come out when extracting msgstore.db, causing it to be corrupted (Unfortunately this is the protagonist file here...). Also, the key doesn't exist in the archive. The extracted .ab file is approximately 5GB in my case, but the resulting tar is only 1.5GB. I'm not sure if this is normal as I have no experience with android backups. My guess is that for some reason the conversion from ab to tar is going wrong, so maybe the problem is with abe.jar... (?)

I also tried with WSL2, getting the same.

aditeyabaral commented 1 year ago

@Frankprog03 The msgstore.db file is already decrypted, so there is no need for a key actually. The issue is, as you said, the file getting corrupted while converting from .ab to .tar. We will have to wait for @YuvrajRaghuvanshiS to give us a better idea of how to debug this issue.

Frankprog03 commented 1 year ago

@aditeyabaral yeah, I know. The key file could have been useful if it had been extracted before the EOF, because then I can easily copy and decrypt msgstore.db.crypt14, which can be read normally.

aditeyabaral commented 1 year ago

@Frankprog03 can the script be modified to copy the key file first to the system? Then we can decrypt any of the files later as well. Unfortunately I do not have much idea about backups so I am not sure how accurately this would work.

YuvrajRaghuvanshiS commented 1 year ago

I am not entirely sure what is causing it or what exactly is this, EOF is too vague.

Yes, it may be that case that this is because of abe.jar but I am not entirely sure. This project is too little maintained and can be considered dead, I don't find enough time to continue working on it (this started out of boredom of lockdown)

I really hope that abe.jar is causing it because while planning to update this repo I was trying to remove any external depencies as a result of which I put together a script (mostly copied :p) which would replace abe.jar #93 .

You can try this with the 'ab' you have extracted.

Create .ab without password so that this can be used (just tested, it works):

import tarfile
import zlib
import io

with open('D:\\Yuvraj\\Work\\GitHub\\WA-KDBE\\extracted\\crashed\\whatsapp.ab', 'rb') as f:
    f.seek(24)  # skip 24 bytes (headers)
    data = f.read()  # read the rest

tarstream = zlib.decompress(data)
with open('D:\\Yuvraj\\Work\\GitHub\\WA-KDBE\\extracted\\crashed\\whatsapp.tar', 'wb') as f:
    f.write(tarstream)

This is another one which is more extensive.

requirements:

black==23.1.0
certifi==2022.12.7
charset-normalizer==3.1.0
click==8.1.3
colorama==0.4.6
idna==3.4
mypy-extensions==1.0.0
packaging==23.0
pathspec==0.11.1
platformdirs==3.1.1
psutil==5.9.4
pycryptodome==3.17
requests==2.28.2
tqdm==4.65.0
urllib3==1.26.15
import codecs
import ctypes
import zlib
from binascii import hexlify, unhexlify
from struct import pack

from Crypto.Cipher import AES
from Crypto.Protocol.KDF import PBKDF2

class AndroidBackupExtractor:
    CHUNK_SIZE: int = 128 * 1024

    def __init__(self, ab_file_path: str, password: str = "") -> None:
        out_file_path = f"{ab_file_path.split('.ab')[0]}.tar"
        self.ab_file = open(ab_file_path, "rb")
        self.out_file = open(out_file_path, "wb")
        self.password = password.encode("utf-8")

    def read_header(self, ab_file) -> None:
        self.header = dict()
        self.header["version"] = ab_file.readline()[:-1]
        self.header["compression"] = ab_file.readline()[:-1]
        self.header["encryption"] = ab_file.readline()[:-1]

        if self.header["encryption"] == b"none":
            pass
        elif self.header["encryption"] == b"AES-256":
            # get PBKDF2 parameters to decrypt master key blob
            self.header["user_password_salt"] = unhexlify(ab_file.readline()[:-1])
            self.header["master_key_checksum_salt"] = unhexlify(ab_file.readline()[:-1])
            self.header["round"] = int(ab_file.readline()[:-1])
            self.header["user_key_iv"] = unhexlify(ab_file.readline()[:-1])
            self.header["master_key_blob"] = unhexlify(ab_file.readline()[:-1])

            print("user password salt:", hexlify(self.header["user_password_salt"]))
            print(
                "master key checksum salt:",
                hexlify(self.header["master_key_checksum_salt"]),
            )
            print("number of PBKDF2 rounds:", self.header["round"])
            print("user key IV:", hexlify(self.header["user_key_iv"]))
            print("master key blob:", hexlify(self.header["master_key_blob"]))
        else:
            raise RuntimeError(
                f"Unsupported encryption scheme: {self.header['encryption']}"
            )

    def decrypt(self, encrypted_iter, aes_obj):
        for encrypted in encrypted_iter:
            yield aes_obj.decrypt(encrypted)

    def chunk_reader(self, ab_file, chunk_size=CHUNK_SIZE):
        data = ab_file.read(chunk_size)
        while data:
            yield data
            data = ab_file.read(chunk_size)

    def master_key_java_conversion(self, master_key_bytes_array):
        """
        because of byte to Java char before using password data as PBKDF2 key, special handling is required

        from : https://android.googlesource.com/platform/frameworks/base/+/master/services/backup/java/com/android/server/backup/BackupManagerService.java
            private byte[] makeKeyChecksum(byte[] pwBytes, byte[] salt, int rounds) {
            char[] mkAsChar = new char[pwBytes.length];
            for (int i = 0; i < pwBytes.length; i++) {
                mkAsChar[i] = (char) pwBytes[i];               <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< HERE
            }
            Key checksum = buildCharArrayKey(mkAsChar, salt, rounds);
            return checksum.getEncoded();
        }

        Java byte to char conversion (as "Widening and Narrowing Primitive Conversion") is defined here:
        https://docs.oracle.com/javase/specs/jls/se8/html/jls-5.html#jls-5.1.4
        First, the byte is converted to an int via widening primitive conversion (chapter 5.1.2),
        and then the resulting int is converted to a char by narrowing primitive conversion (chapter 5.1.3)

        """
        # Widening Primitive Conversion : https://docs.oracle.com/javase/specs/jls/se8/html/jls-5.html#jls-5.1.2
        to_signed: list[int] = [
            ctypes.c_byte(x).value for x in master_key_bytes_array
        ]  # sign extension
        # Narrowing Primitive Conversion : https://docs.oracle.com/javase/specs/jls/se8/html/jls-5.html#jls-5.1.3
        to_unsigned_16_bits: list[int] = [
            ctypes.c_ushort(x).value & 0xFFFF for x in to_signed
        ]

        """ 
        The Java programming language represents text in sequences of 16-bit code UNITS, using the UTF-16 encoding. 
        https://docs.oracle.com/javase/specs/jls/se8/html/jls-3.html#jls-3.1
        """
        to_bytes: bytes = pack(
            f">{len(to_unsigned_16_bits)}H", *to_unsigned_16_bits
        )  # unsigned short to bytes

        to_utf_16_be: str = codecs.decode(to_bytes, "UTF-16BE")  # from bytes to Utf16
        """ 
        https://developer.android.com/reference/javax/crypto/spec/PBEKeySpec.html
        \"Different PBE mechanisms may consume different bits of each password character. 
        For example, the PBE mechanism defined in PKCS #5 looks at only the low order 8 bits of each character, 
        whereas PKCS #12 looks at all 16 bits of each character. \"  
        """
        to_utf_8: bytes = codecs.encode(
            to_utf_16_be, "UTF-8"
        )  # char must be encoded as UTF-8 first

        return to_utf_8

    def get_AES_decrypter(self, password):
        assert (
            self.header["encryption"] == b"AES-256"
        ), f"Not using AES decryption: {self.header['encryption']}"

        # generate AES key from password and salt
        key: bytes = PBKDF2(
            password, self.header["user_password_salt"], 32, self.header["round"]
        )  # default algo is sha1

        decrypted_master_key_blob: bytes = AES.new(
            key, AES.MODE_CBC, self.header["user_key_iv"]
        ).decrypt(self.header["master_key_blob"])

        # parse decrypted blob
        iv_len: int = decrypted_master_key_blob[0]
        iv: bytes = decrypted_master_key_blob[1 : 1 + iv_len]
        master_key_len: int = ord(
            decrypted_master_key_blob[1 + iv_len : 1 + iv_len + 1]
        )
        master_key: bytes = decrypted_master_key_blob[
            1 + iv_len + 1 : 1 + iv_len + 1 + master_key_len
        ]
        checksum_len: int = ord(
            decrypted_master_key_blob[
                1 + iv_len + 1 + master_key_len : 1 + iv_len + 1 + master_key_len + 1
            ]
        )
        checksum: bytes = decrypted_master_key_blob[
            1
            + iv_len
            + 1
            + master_key_len
            + 1 : 1
            + iv_len
            + 1
            + master_key_len
            + 1
            + checksum_len
        ]
        print("IV length:", iv_len)
        print("IV:", hexlify(iv))
        print("master key length:", master_key_len)
        print("master key:", hexlify(master_key))
        print("check value length:", checksum_len)
        print("check value:", hexlify(checksum))

        # verify password
        to_bytes_2: bytes = self.master_key_java_conversion(
            bytearray(master_key)
        )  # consider data as bytes, not str

        print("PBKDF2 secret value for password verification is: ", end="")
        print(hexlify(to_bytes_2))

        calculated_checksum: bytes = PBKDF2(
            to_bytes_2,
            self.header["master_key_checksum_salt"],
            checksum_len,
            self.header["round"],
        )
        if calculated_checksum != checksum:
            print(
                "computed checksum:",
                hexlify(calculated_checksum),
                "is different than embedded checksum:",
                hexlify(checksum),
            )
        else:
            print("password verification is OK")
        # decryption using master key and iv
        return AES.new(master_key, AES.MODE_CBC, iv)

    def decompress(self, compressed_data_iter, block_size=CHUNK_SIZE):
        decompress_obj = zlib.decompressobj()
        for compressed_data in compressed_data_iter:
            yield decompress_obj.decompress(compressed_data)
        yield decompress_obj.flush()
        if not decompress_obj.eof:
            raise RuntimeError("Incomplete or truncated zlib stream")

    def ab_to_tar(self) -> bool:
        if self.ab_file.readline()[:-1] != b"ANDROID BACKUP":
            raise ValueError('Magic is not "ANDROID BACKUP"')

        # parse header
        self.read_header(self.ab_file)

        if self.header["encryption"] == b"AES-256":
            if not self.password:
                self.password = input("Backup is encrypted, enter password: ").encode(
                    "utf-8"
                )
            compressed_iter = self.decrypt(
                self.chunk_reader(self.ab_file), self.get_AES_decrypter(self.password)
            )
        elif self.header["encryption"] == b"none":
            print("No encryption")
            compressed_iter = self.chunk_reader(self.ab_file)
        else:
            raise ValueError("Unknown encryption")

        # decompression (zlib stream)
        print("Writing backup as .tar... ", end="", flush=True)
        for decompressed_data in self.decompress(compressed_iter):
            self.out_file.write(decompressed_data)
        print(
            f"Done. Filename is '{self.out_file.name}', {self.out_file.tell()} bytes written."
        )
        return True

abe = AndroidBackupExtractor("tmp/whatsapp.ab", "password")
res: bool = abe.ab_to_tar()
YuvrajRaghuvanshiS commented 1 year ago

change

def clean_tmp():
    custom_print('>>> I am in view_extract.clean_tmp()', is_print=False)
    if(os.path.isdir(tmp)):
        custom_print(f'Cleaning up \"{tmp}\" folder...', 'yellow')
        shutil.rmtree(tmp)

to

def clean_tmp():
    pass

in view_extract so that it avoids cleaning the ab file

Frankprog03 commented 1 year ago

@YuvrajRaghuvanshiS wow! I will try it as soon as I can. Thanks.

Frankprog03 commented 1 year ago

@YuvrajRaghuvanshiS Ok, you solved my problem but it is weirder than expected. I changed the clean_tmp() function to

def clean_tmp():
    pass

and I executed the script to convert the ab with the script you provided. But the script ran with no problems at all. It seems clean_tmp() somehow conflicted with the conversion but I don't know how. I don't care anymore, though :)

Now I successfully got all the expected files in the extracted folder. Thanks.

YuvrajRaghuvanshiS commented 1 year ago

It seems it tried to clean to the tmp before it actually finished the unpacking (ab -> tar) and doing that it snatched the ab file from abe.jar, hence the size difference. It works I also don't know why. Anyways I am happy it all worked out for you.

UsamaAshfaq commented 1 year ago

@Frankprog03 Can you plz make a fork of this repo with modified code, so that we non-programmers can also benefit from it.

Thankyou.

Frankprog03 commented 1 year ago

@UsamaAshfaq there it is https://github.com/Frankprog03/WhatsApp-Key-Database-Extractor/tree/master I think this is not and shouldn't be a permanent fix, just a "patch".