chriso / gauged

A storage layer for numeric data that changes over time
MIT License
337 stars 11 forks source link

Python 3 support #6

Open wolfhechel opened 9 years ago

wolfhechel commented 9 years ago

I'm working implementing Python 3 support, in order to do this I have had to do some major changes to how tests are run and would like to discuss if you would even accept a PR for this? Otherwise I'll maintain a fork.

In order to run the tests properly I'm using tox as a multi-environment test runner, this will require some changes to the travis configuration in order to run.

I've also moved over to setuptools rather than distutils for packaging and testrunning.

Let me know if you're interested in adding this to the master branch or not.

chriso commented 9 years ago

Thanks for working on this, and yes I'd absolutely accept a PR. I've been meaning to move to tox so there's no problem there. I'm happy to accept any necessary changes to get this working under py3.

wolfhechel commented 9 years ago

I would like your review on a few issues before I make a PR, the first significant issue I encountered during the port was the issue between unicode/bytes in Python 3 vs. Python 2.

I've resolved most of these issues by trying to normalize all keys going in and out from the driver interfaces with to_bytes() which solved most issues with any inconsistencies, but there might need to be some normalization on the output from the drivers as well.

I've got the SQLite Driver to worker properly but I've hit some bumps on the road with MySQL and Postgres.

In MySQL the biggest issue is inconsistent keys; a key in bytes form becomes 'b'key'' with the %s format specifier. I have not yet fully traced the origin of the issue though.

wolfhechel commented 9 years ago

I have pushed my progress in a fork under the branch py3, the biggest changes (apart from Python 3 compatibility changes) are tox usage and moving to setuptools.

For the moment I have two different codebases for Python 2 and 3 tests, mostly to see that the changes I make to the actual codebase doesn't affect the Python 2 compatibility.

chriso commented 9 years ago

Looks good so far.

A couple of questions:

wolfhechel commented 9 years ago

I mainly duplicated the tests because of changes required to be done in the test cases in order to comply with Python 3, and because of that I wanted an unchanged test case to continue asserting that no changes made to the codebase created any regressions under Python 2.

ez_setup is a bootstrap python script to install setuptools if it's missing, bare in mind that users will not retrieve ez_setup nor setup.py since they aren't packaged into the build, ez_setup is just conventient to have for anyone who wants to start quickly to hack on the codebase.

Yes, that looks exactly like what we need. But I think it would be easier to just encode them into ascii or utf8 as a normalization set before formatting the query with the params.

wolfhechel commented 9 years ago

It might be so that the better approach is to normalize the key into an ascii string rather than a bytes-object though, since I only had SQLite to test with initially I made most changes to work properly with that driver, but since both MySQL and Postgres seems to have issues with it it might be better to try and normalize keys in that driver instead.

Also, the tests for the internal functions that asserts the drivers had to be converted to use bytes which I assume is correct since they data in blocks are stored as binaryblobs and I couldn't find any cases where the public functions would generate string-type data.

Shouldn't the test functions always pass in bytes then? Including the tests for Python 2?

chriso commented 9 years ago

Fair enough regarding tests, but once we can assert that no regressions have been introduced with py2 then I'd like to merge them; it's too hard maintaining two sets.

Where did you get the ez_setup.py file from? At the very least I'll need to bundle the corresponding license.

I'm happy to internally encode keys with utf-8 as necessary, however the MySQL column type for gauged_keys.key is VARCHAR(255) BINARY which is possibly why you're having trouble getting the keys to work. The block data is binary and should always be represented by an unencoded sequence of bytes, which is exactly what the python2 str type is.

wolfhechel commented 9 years ago

I've been swamped with work so I haven't had time to continue on this, I hope to be able to look at it this week.

Just to let you know I haven't forgotten about this or just abandoned the ship.

chriso commented 9 years ago

No problem, and thanks.

wolfhechel commented 9 years ago

Right so a short update, I had another look at it last night and got some new thinks to tackle.

Basically the problem revolves around that all database drivers deals with BLOBs differently in Python 3. SQLite for example wants a buffer() in Py2, while in Py3 it wants a bytes(). PostgreSQL deals with BLOBs somewhat concise between the both, however buffer() is deprecated in Python 3 and has been replaced with memoryview(). The API for both of them is somewhat different so it needs to be dealt with another layer of compatibility.

Then there's the key handling which in the schemes is a varchar. The Writer passes all keys through to_bytes() which in Python 3 returns a bytes() object, which is a bytea type in PostgreSQL.

Python-MySQLdb is not at all compatible with Python 3 so MySQL support has to be dropped for that, however there is a API compliant fork of Python-MySQLdb named mysqlclient which is pure-Python, which is also Python 3 compatible.

So, to enable MySQL support in Python 3 we should accept both mysqlclient connections as well as Python-MySQLdb.

I will redo a lot of work and start from the bytes incompatibility instead.

chriso commented 9 years ago

Thanks for the update. It's a bit of a nightmare migrating projects that rely on bytes :expressionless:

wolfhechel commented 9 years ago

Well, things would've been alot easier if SQLite had supported all buffer-types from the beginning. psycopg2 supports buffer and memoryview in all versions, SQLite has buffer in Py2 and bytes in Py3, which is a pita.

I was considering however to simply port over to using memoryview by default since it's the most compliant, and then marshal the bytes and keys in the drivers to appropriate formats instead. Since postgres has some issues with keys being cast to bytes() before trying to insert them into a VARCHAR column (giving variant characters != bytea issues).

Any thoughts on this approach? I do believe it holds some value for custom drivers implementation as well where the driver can itself decide how to deal with keys.

chriso commented 9 years ago

It sounds like the drivers are different enough that you'll need to push down logic from context.py into the drivers. I'm totally happy with that, and if memoryview is the way to go then I'm all for it.

wolfhechel commented 9 years ago

Alright, I'll see what I can do with that. The hardest part seems to be to cast a memoryview to a buffer for SQLite in Py2, but I think I could reuse parts of the code you use in the SparseMap object to obtain a memory address and recreate the buffer from the memory address.

For keys however, would it be of value to be able to have unicode characters in a key? One thing I cannot get rid of however is the fact that keys has to be compared as byte literals which is kind of correct since regular strings has been used in Py2.

chriso commented 9 years ago

Anything unicode should be explicitly encoded with utf-8 as soon as possible, well before the key reaches the driver, and then string comparisons should be done with bytes vs. bytes.

wolfhechel commented 9 years ago

And the returning keys from the context should be unicode encoded?

wolfhechel commented 9 years ago

Sorry, both buttons are far to close.

chriso commented 9 years ago

:smile:

chriso commented 9 years ago

Yeah we should probably return unicode keys then.

wolfhechel commented 9 years ago

Alright, so I had a go at it last night and came up with a solution, which unfortunately became really bulky and a complete hack to get it working. I had a discussion with a few friends on IRC as to why Python doesn't have a builtin way to access the memory pointer from a Old/New type buffer object and someone mentioned the fact that memoryviews as well as buffers are really just functions used to abstract away buffer-compatible objects into a byte-accessible array with zero copy overhead.

So what I would want to give a try is to rewrite SparseMap as well as FloatArray into the C extension and implement both the old and new buffer protocol, rather than casting the addresses in Python.

Do you have any objections to this? And what was the reason you chose to not deal with the C stuff in a C extension from the beginning?

chriso commented 9 years ago

I'd prefer not to go down that route since it's a major rework of the library. I stuck with ctypes over an extension so that it'd work with pypy.

sander76 commented 9 years ago

I can't make it up anymore from this discussing: is Py3 supported or not ?