TonicAI / condenser

Condenser is a database subsetting tool
https://www.tonic.ai
MIT License
312 stars 48 forks source link

Condenser hanging -- debugging options? #22

Open SimonGoring opened 3 years ago

SimonGoring commented 3 years ago

I've created a public gist with my config file and a link to a dump of the database I'm applying condenser against. The issue I'm running into is that condenser appears to hang (over 24hrs with no new text to screen on verbose mode), but I'm not sure how to debug the issue, or know whether or not anything is actually happening.

I'm running condenser as part of a broader workflow through a bash script:

#!/bin/bash
#
# A bash script that uses `condenser` to export a database subset to a database
# to a `localhost` database, and then dump the file and compress it into a tar
# file.
#
# Simon Goring - May 12, 2021
#

# First we check to see if the condenser files actually exist.
if [[ ! -f db_connect.py ]]
then
    echo "Condenser does not exist in the current directory."
    pip install toposort
    pip install psycopg2-binary
    pip install mysql-connector-python
    git clone --depth=1 git@github.com:TonicAI/condenser.git .
    rm -rf !$/.git
fi

# Clone the repo
#
# Remove the .git directory
#rm -rf !$/.git

export PGPASSWORD='DATABASE PASSWORD'
psql -h localhost -U postgres -c "CREATE DATABASE export;"
echo "SELECT 'DROP SCHEMA '||nspname||' CASCADE; CREATE SCHEMA '||nspname||';' FROM pg_catalog.pg_namespace WHERE NOT nspname ~ '.*_.*'" | \
    psql -h localhost -d export -U postgres -t | \
    psql -h localhost -d export -U postgres
python3 direct_subset.py -v
echo "SELECT 'DROP SCHEMA '||nspname||' CASCADE;' FROM pg_catalog.pg_namespace WHERE nspname =ANY('{"ap","da","doi","ecg","emb","gen","ti","ts","tmp"}')" | \
    psql -h localhost -d export -U postgres -t | \
    psql -h localhost -d export -U postgres
now=`date +"%Y-%m-%d"`
mkdir -p dumps
mkdir -p archives
pg_dump -Fc -O -h -o localhost -U postgres -v -d export > ./dumps/$1_dump_${now}.sql
tar -cvf ./archives/$1_dump_${now}.tar -C ./dumps $1_dump_${now}.sql
# -----------------------------------
# |  Clean up files and databases   |
# -----------------------------------
psql -h localhost -U postgres -c "DROP DATABASE export;"
rm ./dumps/$1_dump_${now}.sql
rmdir ./dumps

That's more an FYI about how we're trying to use it though. The key element is that we're just calling condenser with python3 direct_subset.py -v and the config file is linked above in the gist.

The goal of this issue is to note that there seems to be a point at which condenser is hanging, and to figure out a way to debug it so I can fix it.

theaeolianmachine commented 3 years ago

Hi @SimonGoring, do you have any details on the last log statements? I think adding additional logging would certainly be useful for debugging and determining where it's actually hanging.

Notably condenser doesn't do anything fancy to my knowledge with say threads or other forms of deadlocks, so my guess is it might be a query timeout or a connection timeout to the database. It probably wouldn't be too hard to actually hook into where queries are issued to print the last issued query in a debug mode; would be happy to take a look at a PR for that.