FirebirdSQL / fdb

Firebird Driver for Python
https://www.firebirdsql.org/en/devel-python-driver/
Other
61 stars 26 forks source link

OCTETS support for database connection #12

Open savingorup opened 4 years ago

savingorup commented 4 years ago

When connecting with fdb + python3 to legacy databases with mixed/non-proper encodings, you receive a lot of encoding/decoding errors. If there is no option to actually correct this databases, Firebird offers OCTETS encoding to bypass encodings entierly. However, calling fdb.connect(..., charset='OCTETS') produces multiple errors. In fact, this never worked as intended.

The first fix is to change in ibase.py None->'latin-1' charset_map = { 'OCTETS' : 'latin-1', # Allow to pass through unchanged. 'latin-1' is identity encoding in python (as 'octets' is in Firebird).

The second fix is in fbcore.py in build_dpb(): ... if charset and charset!='OCTETS': # added ... and charset!='OCTETS' dpb.add_string_parameter(isc_dpb_lc_ctype, charset.upper()) ... Firebird does not allow to pass OCTETS ad LC_TYPE parameter, so we skip it.

With this two changes, I can pass charset='OCTETS' and do selects/inserts, including BLOBS. If you are interested to merge, I can provide patch against latest version. I am not used to GitHub, so I will probably need a little help with it.

dyemanov commented 4 years ago

To bypass encoding entirely, Firebird supports NONE as a connection charset (rather than OCTETS). Specifying OCTETS at the connection level and then skipping it when passing to the engine (which actually means NONE charset is used) seems weird to me.

savingorup commented 4 years ago

When passing 'NONE' as a connection charset, fdb uses as a python codec the one returned by getpreferredencoding() function (see charset_map definition). This depends on current system configuration and is never 'latin-1' (which maps all bytes 0-255 to the same characters, identity encoding). This in turn produces decoding errors when reading fields, which have string encoded in forms, which getpreferredencoding() can't decode. I assumed that 'NONE' could be actually used by someone, so I opted to change 'OCTETS' which never worked before to keep things backward compatible.

mrotteveel commented 4 years ago

If this is possible to do (and I don't know if it is), I would strongly recommend against using the name OCTETS in a non-standard manner, that would overload meaning of this.

I would rather recommend using a separate property to enable this behaviour. For example, in Jaybird, there are two properties controlling character sets, one using Firebird names and one using Java character set names. When only one is specified, either will configure the connection character set, but specifying both properties then uses one to specify the Firebird connection encoding, and the other will define the Java character set to read/write bytes from/to Firebird.

savingorup commented 4 years ago

This issue is specific to fdb python interface. There is no standard manner of OCTETS use in fdb currently as using it in db = fdb.connect(user='SYSDBA', password='secret', dsn='127.0.0.1:/tmp/test.fdb', charset='OCTETS') simply does not work (throws exception), and AFAIK never did work. Python3 enforces bytes-to-string conversion using known encoding, and python encoding must follow the encoding given in charset=... above. Currently, fdb has no option to set python and FB encodings separatelly (looking at the code, it may be possible internally, but no such interface is exposed). It may be helpfull to see fdb contributors stance on the issue. Anyway, it is not a big deal if such patch does not make it upstream. For us it works and it is surely not hard to maintain. I think it may be helpful for somebody else too.

pcisar commented 4 years ago

The correct method (patch) should use the None/"NONE" connection charset in FDB with its redefinition in ibase.py/charset_map from 'getpreferredencoding()' to 'latin-1', instead hijacking OCTETS.

savingorup commented 4 years ago

I totally agree. However, such a change would be backwards incompatible and could break existing programs, so I searched for alternatives. I can create a patch or pull request with change proposed by pcisar (and test if it works) if there is interest to incorporate it in HEAD.

pcisar commented 4 years ago

Well, I was not aware of 'latin-1' meaning at the time I was creating FDB, so using getpreferredencoding() was the best from bad solutions I was able to came with. So you don't need to do a PR, I'll update the driver myself.

mrotteveel commented 4 years ago

@savingorup This problem is wider than FDB, because NONE has a special meaning in Firebird, and OCTETS cannot be specified as a connection character set in general. Allowing FDB to overload this, makes the behaviour more surprising to people familiar with Firebird on other platforms.

@pcisar Redefining NONE to always map to latin-1 would break expectations of Firebird users that using connection character set none will use the default encoding of a system. Doesn't Python have an option to convert invalid byte combinations to either a ? or the unicode replacement character (, U+FFFD)?

Making this change will also break things for users unknowingly relying on the current encoding behaviour, receiving garbage output (or sending garbage as input to the database). Consider carefully if you really want to make this change. I'd sooner recommend defining a special character set name that maps to NONE + latin-1.

savingorup commented 4 years ago

I agree with Mark here. Changing existing behaviour is bad for users. I proposed 'OCTETS' as it was already there, but maybe defining a new character set is better. However, this new character set will not be supported anywhere except in fdb and will again be a bit surprising to people familiar with Firebird on other platforms

Maybe even better, I propose new parameter python_encoding for fdb.connect() function, such as:

db = fdb.connect(user='SYSDBA', password='secret', dsn='127.0.0.1:/tmp/test.fdb', charset='NONE', python_encoding='latin-1')

Another complication here is that actual encoding/decoding functions are called only in case of python3, not when using python2. This should be made clear to users, maybe using a bit different parameter name:

db = fdb.connect(user='SYSDBA', password='secret', dsn='127.0.0.1:/tmp/test.fdb', charset='NONE', python3_encoding='latin-1')

As I understand, something like this is also implemented in Jaybird. Such a change is not trivial anymore, but I can work on it. What do you think?

On Fri, Jan 24, 2020 at 1:40 PM Mark Rotteveel notifications@github.com wrote:

@savingorup https://github.com/savingorup This problem is wider than FDB, because NONE has a special meaning in Firebird, and OCTETS cannot be specified as a connection character set in general. Allowing FDB to overload this, makes the behaviour more surprising to people familiar with Firebird on other platforms.

@pcisar https://github.com/pcisar Redefining NONE to always map to latin-1 would break expectations of Firebird users that using connection character set none will use the default encoding of a system. Doesn't Python have an option to convert invalid byte combinations to either a ? or the unicode replacement character (�, U+FFFD)?

Making this change will also break things for users unknowingly relying on the current encoding behaviour, receiving garbage output (or sending garbage as input to the database). Consider carefully if you really want to make this change. I'd sooner recommend defining a special character set name that maps to NONE + latin-1.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FirebirdSQL/fdb/issues/12?email_source=notifications&email_token=AGCMHSRJMJQLRPSTW2VRZ2DQ7LOTNA5CNFSM4KKMY3M2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJ2VKXI#issuecomment-578114909, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGCMHSUX3HDTYXFJHES3DQLQ7LOTNANCNFSM4KKMY3MQ .

pcisar commented 4 years ago

@mrotteveel I will change the mapping for 'NONE' charset, not for None value (when connection charset is not specified at all). This should not break any application that does not use connection charset, and will fix the expected behavior of NONE for applications that explicitly define 'NONE'.

@savingorup Automatic conversion is not dependent on Python version, but actual type passed. It's applied for P3.str or P2.unicode, and do not for P3.bytes or P2.str.