`dbscan()` in lib389 can return bytes

vashirov commented 1 year ago

Issue Description dbscan() in lib389 extracts information from the database file. Most of the time the information returned by dbscan executable is strings. But when attribute encryption or changelog encryption is enabled, the database can contain values that can't be parsed as a string in Python.

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
dirsrvtests/tests/suites/replication/encryption_cl5_test.py:65: in _check_unhashed_userpw_encrypted
    dbscanOut = inst.dbscan(DEFAULT_BENAME, 'replication_changelog')
/usr/local/lib/python3.9/site-packages/lib389/__init__.py:3072: in dbscan
    result = subprocess.run(cmd, text=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
/usr/lib64/python3.9/subprocess.py:507: in run
    stdout, stderr = process.communicate(input, timeout=timeout)
/usr/lib64/python3.9/subprocess.py:1121: in communicate
    stdout = self.stdout.read()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <encodings.utf_8.IncrementalDecoder object at 0x7f510ffabb80>
input = b"\ndbid: 0000006f000000000000\n\tentry count: 11\n\ndbid: 000000de000000000000\n\tpurge ruv:\n\t\t{replicageneration}...94\xf7\x9f\xa5\xf4\xfb\xd5\xb49\x87W\n\t\tunhashed#user#password: \xa1\x98\xcf\xea\xb8F\xa8\xc9FHe\x8f\x0b\\\xfa\xd7\n"
final = True

    def decode(self, input, final=False):
        # decode input (taking the buffer into account)
        data = self.buffer + input
>       (result, consumed) = self._buffer_decode(data, self.errors, final)
E       UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 6019: invalid start byte

/usr/lib64/python3.9/codecs.py:322: UnicodeDecodeError

By default subprocess output is considered bytes: https://docs.python.org/3/library/subprocess.html#subprocess.CompletedProcess.stdout

stdout Captured stdout from the child process. A bytes sequence, or a string if run() was called with an encoding, errors, or text=True. None if stdout was not captured.

But we explicitly use text=True to indicate that it is supposed to be a string: https://github.com/389ds/389-ds-base/blob/96959cf7b67be8b544efa25b6ad813c0034841b7/src/lib389/lib389/__init__.py#L3072

I think we should change dbscan() to always return bytes.

vashirov commented 1 year ago

b1bdf5021..4ba619075 389-ds-base-2.1 -> 389-ds-base-2.1 c4a0abf6c..d2af71cf1 389-ds-base-2.2 -> 389-ds-base-2.2 7c7afb78f..f01a61332 389-ds-base-2.3 -> 389-ds-base-2.3

progier389 commented 1 year ago

Seeing a regression in nightly CI tests: (missing a str() in is_dbi in import test ) ( I will create a new pr to fix it )

progier389 commented 1 year ago

b62bd43e8..2dab9224d main -> main 434d63e84..ed0093d02 389-ds-base-2.3 -> 389-ds-base-2.3 94144bb5c..5cc25029e 389-ds-base-2.2 -> 389-ds-base-2.2 fea4a6f61..c714ed8a7 389-ds-base-2.1 -> 389-ds-base-2.1

389ds / 389-ds-base

`dbscan()` in lib389 can return bytes #5872