FirebirdSQL / python3-driver

Firebird driver for Python that uses new Firebird API
https://www.firebirdsql.org/en/devel-python-driver/
MIT License
26 stars 10 forks source link

Problem with firebird stream blobs #31

Open bryancole opened 8 months ago

bryancole commented 8 months ago

I'm trying to use the new v1.10 driver with SqlAlchemy v1.4. I'm trying to load a BLOB type from my database. The problem is that SqlAlchemy iterates over the connection Cursor to obtain all rows, and then passes these rows to a "processor" object to do type-conversions. The DB cursor returns BlobReader objects for each BLOB column. The BlobReader objects are file-like objects that are read later in the processor object, after the cursor iteration has finished. This used to work OK with the old FDB driver. The problem with the new driver is that the cursor object closes all BlobReaders objects after the initial iteration over the cursor. Thus, any later attempt to read the blob data fails.

I can get things to work if I remove the lines in the Cursor._clear() method (see https://github.com/FirebirdSQL/python3-driver/blob/1aaa016468e7ef3eb72616340c3682a706c84b36/src/firebird/driver/core.py#L3642-L3643 ).

I can't figure out what the "correct" DB API compliant behaviour should be as the DB-API doesn't describe any special handling for BLOBs or BlobReader objects. I don't really see that it should be necessary to close the BlobReaders after iteration over the cursor. It's all still within the context of the transaction.

A further related enhancement would be to add a __bytes__(self) method to the BlobReader object. Since the basic DB-API assumes the database will return simple types, a typical way to handle a BLOB would be to pass the value to the python bytes constructor. This will try to iterate over the BlobReader using repeated calls to BlobReader.readline(). This also fails. Even if this succeeded, this is an inefficient way to convert a BLOB to bytes (unless it really is text split into lines). Better to just call BlobReader.read() and get the entire thing as a bytes object. This is easily implementing in a __bytes__() method.

I could submit patches for these but I'm holding off to see what you think.

pcisar commented 8 months ago
  1. BlobReaders are not closed when cursor iteration is finished, but when cursor is closed. That's because the underlying Firebird structures are AFAIK related. So, until you do not close the cursor, all returned BlobReader instances are valid and could be used to read BLOB values even repeatedly. Cursors are closed either explicitly via Cursor.close(), or implicitly when: a) new statement is executed by cursor, b) cursor is used in context manager at the end of its code block, c) transaction end (commit/rollback), or d) connection is closed
  2. The best method to read BLOB value via BlobReader (read/readline/readlines/iteration) depends on BLOB sub-type (text or binary). You can use BlobReader.is_text() method to determine this. However, this method is valid only for TEXT/BINARY BLOB su-btypes. There are other sub-types (including user ones) that could be either text or binary. In such case you need to know these sub-types in your application and use the proper read method according to BlobReader.sub_type value.
  3. About introducing the __bytes__ method to BlobReader, I will not do that. BLOB values could be quite big (even gigabytes) and reading the whole value into memory is a perfect method how to shoot itself in the foot. It also beats the whole idea of BlobReader, that is there to avoid such disasters (there is also a safety threshold, see Cursor.stream_blob_threshold in documentation). Anyway, bytes() will work only for binary BLOBs, but not for text ones. They are returned as Python str, not bytes (see driver documentation for automatic string conversions and character set handling).

Note that iteration over BlobReader uses readline, so it works only with text blobs. I considered using read() for binary BLOBs, but rejected the idea, as there is no easy way how to determine the amount of data to be returned. Of course, I could define a BlobReader attribute (for example fetch_size) with default 1 that could be used for that, but it will only clutter the BlobReader instances and does not provide much convenience and code readability over direct use of read() in application. But I'm open to discus that.

Also note, that BLOB value does not need to be returned as BlobReader. Normally they are returned as materialized values (either str or bytes according to BLOB sub-type). BlobReader is returned only when explicitly requested via Cursor.stream_blobs list or when size of value to be returned exceeds Cursor.stream_blob_threshold (default is 65536 bytes, i.e. 64K). Application needs to be prepared for that. Also note that stream_blob_threshold value could be configured at driver level, see DriverConfig.stream_blob_threshold configuration option.

bryancole commented 8 months ago

Thanks for the feedback.

Of course you are right, sqlalchemy is closing the DB-API cursor immediately after iterating over it, and before accessing the data in the row-object. Sadly, there does not seem to be any way to avoid this behaviour in the sqlalchemy core except by modifying the sqlalchemy code directly. I will take this question to the sqlalchemy issue-list.

Ultimately, the easiest solution for my problem of reading large-ish BLOBs using sqlalchemy is solved if I set the stream_blob_threshold to a very large value. For BLOBs smaller than stream_blob_threshold the Cursor.stream_blobs overrides the default materialised blob behaviour (as you point out). I think it would be similarly useful to have an override to force large blobs (> stream_blob_threshold ) to be materialised. There is a clear use case for making large blobs materialised, in cases where the calling library is expecting the return type for the data to be a simple type. E.g. sqlalchemy doesn't know anything about the existence of BlobReader objects, so how can it know that it must not close the cursor before accessing the returned object?

I am surprised to discover that it is not possible to pass a python file-object to the bytes constructor, to read the a binary file into a bytes-object. I really expected this to work. I was wrong. Since the BlobReader is expected to behave the same as a file-object, I think I agree that adding a __bytes__ method would be adding non-file-like behaviour. However, it is possible to iterate over a binary file-object. This will read the entire file on the first call to __next__(), then iteration will end. By contrast, the BlobReader raises an exception: firebird.driver.types.InterfaceError: Can't read line from binary BLOB. My opinion is that iteration over a binary stream should return chunks. The chunk size could default to the system PAGESIZE, for example. The chunk size is rarely important except to be not-too-small and not-too-big.