Fix postgresql functionality in `queries.py`.

changeling commented 5 years ago

Add check for database engine. Add conditional CHUNK_SIZE for _sql_split() for database backend. Add c.name to GROUP BY clause in _query_get_sex() as per postgres requirement. Add conditional initial and group_concat declarations (string_agg for postgres) for postgres functionality in get_ancestorss() and get_descendants(). Minor change to query execution in get_descendants() to match get_ancestorss() structure.

briot commented 5 years ago

Excellent and timely PR ! I was wondering whether you would have time to look at Postgresql, since sqlite is no match given the size of your database. I wonder how much of an improvement you saw by moving to Postgresql ?

I am doing some work to optimize the Source View page by doing similar recursive queries and also using group_concat, so I'll take your patch into account for that. Will review soon

changeling commented 5 years ago

I'm not able to import yet, though I suspect this has more to do with the mess these big files are in and postgres being more rigid than sqlite, but it does make a huge difference! To get the data in, I used the amazing pgloader, which can take a sqlite db and transfer schema, data and all into a postgres database. I haven't spent any time on benchmarking, but you might appreciate that all operations are generally between 80 and 190 seconds, with the following dataset:

Next, I'll load in the ManyMany generation-deep data and take a look.

The debounce and paged list views are really great! I have some thoughts regarding interface that I'll add to an issue after I sleep for a bit. Cheers!

briot commented 5 years ago

I'm not able to import yet, though I suspect this has more to do with the mess these big files are in and postgres being more rigid than sqlite, but it does make a huge difference! To get the data in, I used the amazing pgloader, which can take a sqlite db and transfer schema, data and all into a postgres database. I haven't spent any time on benchmarking, but you might appreciate that all operations are generally between 80 and 190 seconds, with the following dataset:

80 to 190 seconds seems like a huge amount of time to display a page. We still need to do quite a lot of tweaking if that’s the numbers you have. I was hoping for much much faster times.

changeling commented 5 years ago

I suspect that had to do with a slightly out of date sqlite database. I imported the deep file into a fresh one, the created a postgres database from that, and now most operations, at 20 generations, are taking around 30 seconds.

briot commented 5 years ago

I suspect that had to do with a slightly out of date sqlite database. I imported the deep file into a fresh one, the created a postgres database from that, and now most operations, at 20 generations, are taking around 30 seconds.

Still a lot. Could you identify which query is the slowest, and perhaps run “explain analyze” on it in psql ?

Thanks

changeling commented 5 years ago

Still a lot. Could you identify which query is the slowest, and perhaps run “explain analyze” on it in psql ? Thanks

I'll look at those, yep! Another thing, with the file that goes back many, many generations, both quilts and stats chew gigabyte upon gigabyte of disk space (>15GB) and consume all of my available RAM on a 16GB MacBook Pro, until the machine dies. I'll put that in an issue.

changeling commented 5 years ago

(To be clear, this is with postgres, which, I believe, is the RAM culprit. I'll look to identify where my disk space is going, but first, I'm at this moment testing stats and quilts with sqlite.

briot commented 5 years ago

(To be clear, this is with postgres, which, I believe, is the RAM culprit. I'll look to identify where my disk space is going, but first, I'm at this moment testing stats and quilts with sqlite.

BTW, can you also try running

VACCUM ANALYZE

in postgresql ? In my experience, this results in significant improvements when you just loaded a large table

changeling commented 5 years ago

I'll look at doing VACUUM ANALYZE when I get back to postgres. Though the automatic VACUUM helpers are very active processes as the RAM dwindles. The sqlite stats test is still proceeding from the time of my last comment, currently having used about 20 GB of disk space. The memory management seems drastically better, without much impact. I started with around 80% free, and am hovering around 70% throughout the run.

changeling commented 5 years ago

I had to stop the sqlite stats test after 3 hours (and 34GB disk space used) as my available disk space was down to 162MB remaining. This is with the ManyMany...ged file I believe I've shared with you.

briot commented 5 years ago

I had to stop the sqlite stats test after 3 hours (and 34GB disk space used) as my available disk space was down to 162MB remaining. This is with the ManyMany...ged file I believe I've shared with you.

For the quilts view, I need to improve the code a bit: even though you might have requested only 10 generation in the GUI, the backend is still going to compute full depth, so that’s no good.

You likely also have a similar issue with the pedigree if you use the “Custom Theme Demo” colors. They need to compute what persons are in you tree, and this has to be the slow query here (it is also needed by the Stats view). I am not sure how much it could be improved, but 1 million persons should be no problem for Postgres.

I do have the ManyMany files, though I don’t think I had 1 million persons there. I’ll do some testing in a few days

changeling commented 5 years ago

The ManyMany file is around 220,000 people, but many generations, and no descendants, only ancestors. I'm actually no longer sure how many generations (This would be a neat feature for the dashboard, deepest generation in the tree). It might also be a good idea to set the bounds for zooming in the different graphics to the selected generation level. When I have extreme charts, the zoom level is insufficient to get the entire image onscreen.

The GrandSchemeTree, with ~million people, doesn't seem to have any problems, though it's only about 8 generations deep, with 8 generations of descendants.

Pedigree is no problem with the preset themes, one of the ~30 second queries, as are all the others, set to their max. I've even tweaked the various Side.tsx files to go way, way out, and they run fine. I'll try them with custom themes (very cool feature, by the way).

Radial can get weird at the later generations, likely due to problems in the data, or descendants that overlap, as do the Fan charts.

Here are examples of deep Radial and Fan charts:

Radial:

Fan:

briot commented 5 years ago

I'm generally averse to migrations unless models etc change in code, as they can eventually grow into a management nightmare. That said, perhaps that's the way to go. There's some discussion of the problem here: https://stackoverflow.com/questions/47637652/can-we-define-a-group-concat-function-in-postgresql

Ideally there would be a portable equivalent approach that is database agnostic, but for now this might be the way to go.

I was thinking about it, and perhaps that group_by can be done in python instead. I’ll have a look.

changeling commented 5 years ago

Sorry I had that ongoing thread in here. I moved most of it into https://github.com/briot/geneapro/issues/59.

briot commented 5 years ago

I just commit significant patches that create conflicts here. Mostly this is because queries.py was moved to a new subdirectory sql/sqlsets.py. I had been waiting to check whether you could submit a revised PR first, but after a few days I feel better if I can push my changes... Sorry for the extra work, let me know if you need help

briot commented 5 years ago

You provided a different patch in another PR

briot / geneapro

Fix postgresql functionality in `queries.py`. #56