Clean up unicode handling in pg text dump to json

shahin commented 8 years ago

The old combination of (a) text-formatted psql output and (b) Python 2 unicode handling was causing a mess. This PR:

Ports pgtsv_to_json to Python 3 for cleaner unicode handling,
Changes its expected input to csv,
Changes the call to pgtsv_to_json to provide csv input
Renames pgtsv_to_json to pgcsv_to_json to avoid confusion

netj commented 8 years ago

Is the switch to csv for distinguishing NULLs in arrays? I remember python's csv reader is fundamentally flawed for distinguishing empty strings from nulls (,"", vs ,,) in arrays.

shahin commented 8 years ago

No, I'm not concerned about NULLs arrays here. Switching to (a) Postgres' CSV format and (b) Python 3 means we can handle unicode with way less custom escaping/unescaping. @alldefector pointed me in the right direction here.

All the Postgres TEXT format encoding caveats plus all the Python 2-specific encoding caveats added up to a pile of brittle exception code (see: most of my commits over the past few days).

We won't need to distinguish between ,"", and ,, as long as we use an explicit null as '\\N' or similar.

netj commented 8 years ago

Just confirming it's a hard problem and pointing out that nulls inside arrays are encoded differently from values outside.. at least for TEXT. Not sure about csv. Csv is generally not friendly to standard UNIX tools and many scripts including mkmimo that rely on one-line-per-row.

netj commented 8 years ago

I now remembered why we couldn't use CSV or at least Python's csv.reader.

First, let's look at TSV for the values that give us headache:

postgres=# COPY (SELECT 'NULL', NULL, E'\\N', '', ARRAY['NULL', NULL, E'\\N', '']::TEXT[]) TO STDOUT;
NULL    \N  \\N     {"NULL",NULL,"\\\\N",""}

Although the TABs are invisible, you can tell each value has a distinct encoding in TSV that can be relatively easily parsed by repeatedly splitting and unescaping.

Now, let's add CSV to it and see how it looks:

postgres=# COPY (SELECT 'NULL', NULL, E'\\N', '', ARRAY['NULL', NULL, E'\\N', '']::TEXT[]) TO STDOUT CSV;
NULL,,\N,"","{""NULL"",NULL,""\\N"",""""}"

Seems not so bad, so we may try to use Python's csv.reader, but we immediately face a challenge of distinguishing the second field for NULL ,, from empty string at the fourth ,"",. (This is a stupid limitation of Python's csv package. I've looked hard but no alternatives than paying a huge performance overhead by just parsing csv ourselves in Python.)

You may think it's possible to use csv.reader if we change the NULL to use a different encoding, but it's impossible to avoid ambiguity no matter what you do without the ability to distinguish a quoted value from ones that are not quoted. For example, you'll treat user's \N as NULL if you mess with the NULL AS ... clause.

postgres=# COPY (SELECT 'NULL', NULL, E'\\N', '', ARRAY['NULL', NULL, E'\\N', '']::TEXT[]) TO STDOUT CSV NULL AS E'\\N';
NULL,\N,"\N",,"{""NULL"",NULL,""\\N"",""""}"

So we shall give up csv.reader and end up with a very slow pg2json piece in Python, or switch to another language. (Good news is that at least the NULLs in arrays are consistently NULL.)

Moreover, I don't understand how unicode and their escape sequences were magically handled by switching to Python3 and maybe also CSV, but if we're going to get this part absolutely right, I'd argue we have to explicitly handle all of Postgres' escape sequences similar to this: https://github.com/HazyResearch/deepdive/blob/fa1a48fa52044ecac2bfdb538a1e18b80e239f2c/database/tsv2tsj#L32-L41

Personally, I'm inclined to completely hiding Postgres' TSV format in the db-drivers and moving all UDFs and other pieces completely away from it. JSON per line maybe okay but it still has a lot of overhead (repeated column names), so I was prototyping the TSJ format. PG TSV is a complete mess and the root of all these unnecessary complexity, and I believe we should solve this problem once and for all in the db-driver. I think the recent PRs fixing the json support is in the right direction, and we probably need to be more systematic.

netj commented 8 years ago

@shahin Can you add some tests that check proper unicode handling? It seems tweaking tsv2tsj to handle arrays can be a simple way of fixing pg?sv_to_json here.

shahin commented 8 years ago

@netj Yeah, I would also like to switch back to Travis's container-based infra. It is possible that bash 4.3 is all that's needed. Two questions:

Why do we bundle bash with DeepDive? This seems like a very low-level dependency to include in our distribution. As a DeepDive user or new developer, I imagine I'd be very surprised to find out (eventually) that it's not using the same bash that I'm using on my host system. Seems like an explicit dependency on a given bash version would be clearer.
As long as we're bundling bash, should we be using it for all tests? An exception for a single .bats seems easily overlooked, and now our tests are targeting different platforms. You know more about bats than I do -- what's the best way to run them all on bash 4.3?

netj commented 8 years ago

@shahin

Why do we bundle bash with DeepDive? This seems like a very low-level dependency to include in our distribution. As a DeepDive user or new developer, I imagine I'd be very surprised to find out (eventually) that it's not using the same bash that I'm using on my host system. Seems like an explicit dependency on a given bash version would be clearer.

Because it's almost impossible to ask every user to install the exact bash version on their system, we decided to bundle the exact version our scripts need: 4.3.x. Bundling GNU coreutils is along the same line. Mac users wasted many hours/days if not weeks because of these since it's bash 3 and BSD. Docker could achieve something similar, but it rather avoids the portability problem not solving it, and it's too heavy to just get these few megs of portable dependencies (on my Mac: bash is 7MB, coreutils is 9MB).

As long as we're bundling bash, should we be using it for all tests? An exception for a single .bats seems easily overlooked, and now our tests are targeting different platforms. You know more about bats than I do -- what's the best way to run them all on bash 4.3?

Sure. The example I commented was just for illustration. I agree it's better to run all tests with our own stuff, which is already the case for the rest, but bash used by bats was an unfortunate exception that wasted your time. I think you can prefix the make test in .travis.yml with dist/stage/bin/deepdive env:

dist/stage/bin/deepdive env  make test

chrismre commented 8 years ago

Maybe our reliance on shell versions and scripts is too much... why not use something like python that seems to be more portable? ...

On Tue, Jun 28, 2016 at 9:54 AM Jaeho Shin notifications@github.com wrote:

@shahin https://github.com/shahin

Why do we bundle bash with DeepDive? This seems like a very low-level dependency to include in our distribution. As a DeepDive user or new developer, I imagine I'd be very surprised to find out (eventually) that it's not using the same bash that I'm using on my host system. Seems like an explicit dependency on a given bash version would be clearer.

Because it's almost impossible to ask every user to install the exact bash version on their system, we decided to bundle the exact version our scripts need: 4.3.x. Bundling GNU coreutils is along the same line. Mac users wasted many hours/days if not weeks because of these since it's bash 3 and BSD. Docker could achieve something similar, but it rather avoids the portability problem not solving it, and it's too heavy to just get these few megs of portable dependencies (on my Mac: bash is 7MB, coreutils is 9MB).

As long as we're bundling bash, should we be using it for all tests? An exception for a single .bats seems easily overlooked, and now our tests are targeting different platforms. You know more about bats than I do -- what's the best way to run them all on bash 4.3?

Sure. The example I commented was just for illustration. I agree it's better to run all tests with our own stuff, which is already the case for the rest, but bash used by bats was an unfortunate exception that wasted your time. I think you can prefix the make test in .travis.yml with dist/stage/bin/deepdive env:

dist/stage/bin/deepdive env make test

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/HazyResearch/deepdive/pull/551#issuecomment-229111366, or mute the thread https://github.com/notifications/unsubscribe/AHPtuDiERKFmFNw4GmxEsDxun6OQ6eEeks5qQVG5gaJpZM4I8ClL .

shahin commented 8 years ago

Mac users wasted many hours/days if not weeks because of these since it's bash 3 and BSD.

Was this before homebrew/macports/fink? All the Mac users I know install tons of packages with GNU/newer dependencies, but I don't think many of those actually bundle bash or coreutils. I wonder if we could revisit this later and simplify our build.

netj commented 8 years ago

@chrismre Python has even more version issues (py2, py3, pypy, ...) and we already solved the shell version issue long ago. I don't think it's a good idea to revisit an already solved problem.

netj commented 8 years ago

@shahin

Mac users wasted many hours/days if not weeks because of these since it's bash 3 and BSD.

Was this before homebrew/macports/fink? All the Mac users I know install tons of packages with GNU/newer dependencies, but I don't think many of those actually bundle bash or coreutils. I wonder if we could revisit this later and simplify our build.

The audience of the tool has been less technically skilled (scientists, doctors) than typical software engineers around you, and they often don't have control over the compute resources they want to run DeepDive on. I believe bundling even more, e.g., postgresql clients, would even lower the bar. Many users are in fact struggling and mentioning they waste most time at installing dependencies we don't bundle, e.g., postgresql clients, which can be easily added with reasonable defaults.

If we start distributing releases via Homebrew, APT, yum, then yes we should install the runtime dependencies natively rather than bundling. However, that requires more maintenance effort (otherwise, your package gets tossed out) and lower coverage (just Mac and Ubuntu/Debian?) than just bundling whatever we depend on. The bundles build is completely separate from the rest, so you can just comment out lines from extern/bundle.conf to omit them. I agree the initial build for developers takes too long, so it's a good idea to have developers install all runtime deps themselves, and have the bundled dependencies only built at the time we're creating binary releases.

alldefector commented 8 years ago

@netj bundling GP and PG clients are terrific ideas. They are causing lots of pain right now.

shahin commented 8 years ago

@netj @alldefector this is all tangential, sorry for hijacking this thread:

they often don't have control over the compute resources they want to run DeepDive on

Interesting! Sounds like I need to learn more about our users (and what you mean by "control"). The deepdive-users list is an obvious place to start, but it's sparse and biased. Do we have other data about our user base that I could check out?

Many users are in fact struggling and mentioning they waste most time at installing dependencies we don't bundle

Isn't this exactly what dependency management tools are for?

If we start distributing releases via Homebrew, APT, yum, then yes we should install the runtime dependencies natively rather than bundling. However, that requires more maintenance effort

In many ways a package manager reduces maintenance effort. Package managers take things like CPU architecture, security patches, bug fixes, API-stable updates, custom shared libraries, etc. into account when resolving dependencies. If we're saving time on maintenance today then it's because we're ignoring all these things, not because we've found a way to do them better than a package manager!

and lower coverage

The abstraction we get from ordinary dependency management seems likely to increase coverage, not lower it. If we just specify dependency names then DeepDive can be ported to any platform that implements those names, including the BSDs, Cygwin, or whatever. If we bundle dependency code or binaries, then it'll run only on platforms that we take the time to bundle dependencies for. What am I missing here?

netj commented 8 years ago

sorry for hijacking this thread:

@shahin No worries. This is a perfect place/timing to discuss the matter.

Do we have other data about our user base that I could check out?

Unfortunately, we don't have a systematically collected profile of user base (yet..). But it's clear that there are many users who aren't proficient at operating/programming software, just wanting to extract databases from their dark data with minimal effort.

Many users are in fact struggling and mentioning they waste most time at installing dependencies we don't bundle Isn't this exactly what dependency management tools are for?

Our typical users can't install any software via APT, yum, ports, etc. on their compute resource. They don't have root access, their IT management refuses to install anything, and they can't offer their own machines or cloud instances. That's what I meant by lack of "control." No matter how much effort we put into leveraging all those nice infrastructure laid by package managers, a major number of users will need an alternative, more straightforward way.

In many ways a package manager reduces maintenance effort. Package managers take things like CPU architecture, security patches, bug fixes, API-stable updates, custom shared libraries, etc. into account when resolving dependencies. If we're saving time on maintenance today then it's because we're ignoring all these things, not because we've found a way to do them better than a package manager!

I agree that's the right way, but behind all those rigorous things, someone had to put time and effort identifying and writing correct debian/control files, brew formulae, Portfiles, etc. Maintaining them requires even more effort as the dependent packages are also constantly evolving and can break your own package in nontrivial ways. We may consider publishing these packages once DeepDive gets stable enough, but I wouldn't consider package managers as the exclusive way of distributing releases. I'd still argue for keeping standalone releases that simplifies many users' lives.

and lower coverage

The abstraction we get from ordinary dependency management seems likely to increase coverage, not lower it. If we just specify dependency names then DeepDive can be ported to any platform that implements those names, including the BSDs, Cygwin, or whatever. If we bundle dependency code or binaries, then it'll run only on platforms that we take the time to bundle dependencies for.

DeepDive can already be ported to any POSIX-compliant platform by just running make on its source tree. As long as the bundled dependencies themselves are portable (most of them are also built from source), DD source tree can produce a standalone tarball exactly for that platform. For the moment, we just publish two x86_64 binary releases for Mac and Linux (Ubuntu and RedHat). But we can always target more platforms with a make, e.g., if BASH on Windows turns out to be working great. I was saying this approach has greater coverage than optimistically targeting a few package managers. In fact, having a portable source is a prerequisite to publishing packages in many venues anyway.

What am I missing here?

I guess you're missing that bundled dependencies are also built from source. DeepDive installer also takes care of installing some runtime dependencies that weren't bundled via APT, Homebrew, yum, namely Java, Python, Perl that are pretty universally available. In the long run, I totally agree we should turn all dependencies into proper package manager specs, instead of bundling any in the published package. (Keeping standalone release is a separate issue) Anyway, in general, hooking all the dependencies up to the rest of the OS packages may seem trivial, but the devil is always in the details, and it's often a moving target, out of our control.

alldefector commented 8 years ago

@netj: @shahin has a prototype export_to_json routine using psycopg2 that's only 1.5-3x as slow compared to COPY to TSV. Maybe let's just bundle psycopg2 and call it a day here?

netj commented 8 years ago

@alldefector Sounds like a good idea. Can the code be shared? Or I'd like to just revert all the unicode commits to restore Travis status to green (master has been broken for a while due to this making hard for everything else). The performance isn't a huge deal as this part should only activate for older PG and GP. It won't be hard to find a neat way to plug the psycopg2 code into the existing db-driver (db-query). Setting up a DeepDive-wide PYTHONPATH for bundling may require a bit of thought and work, which I can take care of.

alldefector commented 8 years ago

@netj Great! @shahin can push the script in the next K days (perhaps in a separate PR).

If we could remove the dependency on "psql" (and others like "createdb"), that'd be fabulous! That would also make TSJ implementation much easier, I suppose.

netj commented 8 years ago

@alldefector OK, then I can just restore things in this PR.

I think we'll soon remove the dependency on psql and other standard PG interface by bundling it. We don't need to put slow code in the critical data flow for aesthetic reasons. (bitten by that many times) Btw TSJ is pretty much done on top of standard PG interface in the other branch. In the future we could rewrite TSJ in a way that bypasses PG TSV, but I think it's going to look like PG UDFs in C/C++ rather than something in Python.

HazyResearch / deepdive

Clean up unicode handling in pg text dump to json #551