Ravencentric / archivefile

Unified interface for tar, zip, sevenzip, and rar files
https://archivefile.ravencentric.cc/
The Unlicense
13 stars 0 forks source link

solid 7z file use archive.read_bytes(am.name)[12:21] == b'somebyte' will get py7zr.exceptions.CrcError #2

Closed kokutoukiritsugu closed 1 month ago

kokutoukiritsugu commented 1 month ago
Traceback (most recent call last):
  File "D:\cdg\cdg\cdg_search.py", line 78, in check_file_is_enced1
    if archive.read_bytes(am.name)[12:21] == b'somebyte':
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\cdg\cdg\Python312\Lib\site-packages\pydantic\validate_call_decorator.py", line 60, in wrapper_function
    return validate_call_wrapper(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\cdg\cdg\Python312\Lib\site-packages\pydantic\_internal\_validate_call.py", line 96, in __call__
    res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\cdg\cdg\Python312\Lib\site-packages\archivefile\_core.py", line 685, in read_bytes
    data = self.extract(member, destination=tmpdir.name).read_bytes()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\cdg\cdg\Python312\Lib\site-packages\pydantic\validate_call_decorator.py", line 60, in wrapper_function
    return validate_call_wrapper(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\cdg\cdg\Python312\Lib\site-packages\pydantic\_internal\_validate_call.py", line 96, in __call__
    res = self.__pydantic_validator__.validate_python(pydantic_core.ArgsKwargs(args, kwargs))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\cdg\cdg\Python312\Lib\site-packages\archivefile\_core.py", line 507, in extract
    self._handler.extract(path=destination, targets=[member])
  File "D:\cdg\cdg\Python312\Lib\site-packages\py7zr\py7zr.py", line 1012, in extract
    self._extract(path, targets, return_dict=False, recursive=recursive)
  File "D:\cdg\cdg\Python312\Lib\site-packages\py7zr\py7zr.py", line 629, in _extract
    self.worker.extract(
  File "D:\cdg\cdg\Python312\Lib\site-packages\py7zr\py7zr.py", line 1253, in extract
    self.extract_single(
  File "D:\cdg\cdg\Python312\Lib\site-packages\py7zr\py7zr.py", line 1341, in extract_single
    raise e
  File "D:\cdg\cdg\Python312\Lib\site-packages\py7zr\py7zr.py", line 1338, in extract_single
    self._extract_single(fp, files, path, src_end, q, skip_notarget)
  File "D:\cdg\cdg\Python312\Lib\site-packages\py7zr\py7zr.py", line 1375, in _extract_single
    self._check(fp, just_check, src_end)
  File "D:\cdg\cdg\Python312\Lib\site-packages\py7zr\py7zr.py", line 1432, in _check
    raise CrcError(crc32, f.crc32, f.filename)
py7zr.exceptions.CrcError: (2426214743, 2390100170, 'asdf.pdf')

solid 7z file will get error. non-solid 7z file no problem.

Ravencentric commented 1 month ago

Is it an archivefile issue or is it a py7zr issue? Can you check if you get the same issue when using py7zr on it's own?

kokutoukiritsugu commented 1 month ago

py7zr no problem

            with py7zr.SevenZipFile(full_file_path) as archive1:
                for fname, bio in archive1.readall().items():
                    print(f'{fname}: {bio.read(21)}')

asdf_10.pdf: b'b\x14#eb\x00\x9e\x01\x00\x00\x00\x01somebyte'
asdf_3.pdf: b'b\x14#eW\x00\xa9\x01\x00\x00\x00\x01somebyte'
asdf_4.pdf: b'b\x14#el\x00\x94\x01\x00\x00\x00\x01somebyte'
asdf_5.pdf: b'b\x14#ek\x00\x95\x01\x00\x00\x00\x01somebyte'
asdf_6.pdf: b'b\x14#eh\x00\x98\x01\x00\x00\x00\x01somebyte'
asdf_7.pdf: b'b\x14#e_\x00\xa1\x01\x00\x00\x00\x01somebyte'
asdf_8.pdf: b'b\x14#eq\x00\x8f\x01\x00\x00\x00\x01somebyte'
asdf_9.pdf: b'b\x14#ep\x00\x90\x01\x00\x00\x00\x01somebyte'
Ravencentric commented 1 month ago

If you can give me steps to reproduce this, I can probably look into fixing this

kokutoukiritsugu commented 1 month ago

just use 7-Zip compress some file check solid

use read_bytes(am.name)[12:21]

Ravencentric commented 1 month ago

I've added solid 7z files to test_data (https://github.com/Ravencentric/archivefile/commit/8cccb956e9b1a10250e5ef74be26cf6523bf7c63) and added read tests (https://github.com/Ravencentric/archivefile/commit/2e56f58cb932628c2335ce797dad7370a2d2445a). As you can see, the tests pass without issues and I cannot reproduce this on my end. Unless you give me concrete reproduction steps I cannot help you anymore.

kokutoukiritsugu commented 1 month ago

problem 7z file inside this zip file. 3月1_2.zip

Ravencentric commented 1 month ago

I'll take another look

Ravencentric commented 1 month ago

@kokutoukiritsugu I fixed it in https://github.com/Ravencentric/archivefile/pull/3. Would be nice if you could test it and let me know before I merge and release

kokutoukiritsugu commented 1 month ago

function ok, but speed slow in a lot of file in 7z...

            with archivefile.ArchiveFile(apb, 'r') as archive:
                for name in archive.get_names():
                    if archive.get_member(name).is_file:
                        check_archive_enc1(apb, name, archive.read_bytes(name)[12:21])

vs

            with py7zr.SevenZipFile(apb) as archive:
                for name, bio in archive.read().items():
                    if not name.endswith("/"):
                        check_archive_enc1(apb, name, bio.read(21)[12:21])
Ravencentric commented 1 month ago

You can do for member in archive.get_members() there. Being slower than the dedicated library is expected because archive file is a wrapper after all but if you can time it that would be nice to get an idea of how slow it actually is.

kokutoukiritsugu commented 1 month ago

i try, use for member...

6.49850606918335 vs 0.5358200073242188

yes, read_bytes not best suitable for warpper

Ravencentric commented 1 month ago

That's slower than I expected. Anyway that's something I'll look into now but not really an immediate goal. I'll close this issue when I merge #3

kokutoukiritsugu commented 1 month ago

ok thanks very much !

Ravencentric commented 1 month ago

4 is pretty much a complete re-write which does end up speeding things up a bit

from time import perf_counter

import archivefile
import py7zr

file = "3月1.7z"

start = perf_counter()
with archivefile.ArchiveFile(file) as archive:
    for member in archive.get_members():
        if member.is_file:
            archive.read_bytes(member)
print(perf_counter() - start)

start = perf_counter()
with py7zr.SevenZipFile(file) as archive:
    for name, bio in archive.read().items():
        if not name.endswith("/"):
            bio.read()
print(perf_counter() - start)
ArchiveFile: 0.013419300004898105
SevenZipFile: 0.007313699999940582

Although it will never beat the underlying library for obvious reasons, think I'm happy with the minor improvements