OHDSI / WhiteRabbit

WhiteRabbit is a small application that can be used to analyse the structure and contents of a database as preparation for designing an ETL. It comes with RabbitInAHat, an application for interactive design of an ETL to the OMOP Common Data Model with the help of the the scan report generated by White Rabbit.
http://ohdsi.github.io/WhiteRabbit
Apache License 2.0
173 stars 85 forks source link

Faster sampling for Postgres? #387

Open schuemie opened 9 months ago

schuemie commented 9 months ago

Currently sampling on Postgres uses ORDER BY RANDOM() LIMIT (see here), which means the server first has to order the entire table randomly, and then take the top n rows.

Perhaps TABLESAMPLE could be used for better performance? (Similar to MSSQL)

blootsvoets commented 9 months ago

Makes sense. Bigquery supports it as well, but in beta only. Limitations:

In order to limit the impact of these limitations, we could take the following approach.

// overestimate the number of samples by a factor 2
int percentage = 2 * 100 * sampleSize / tableCount;
// table sample has an effect: less than 50% of the number of records will be sampled.
if (percentage < 50) {
    // avoid too small samples
    if (percentage < 2) {
       percentage = 2;
    }
    query = "SELECT * FROM " + table + " TABLESAMPLE SYSTEM (" + percentage + ") ORDER BY RANDOM() LIMIT " + sampleSize;
    // optional: resample without table sample if the number of records is still lower than the sampleSize.
} else {
    query = "SELECT * FROM " + table + " ORDER BY RANDOM() LIMIT " + sampleSize;
}
schuemie commented 9 months ago

Yes, that sounds like a good idea. The fact that the sample isn't truly random and is not guaranteed to return the exact number of rows should be ok for this purpose (IMHO).

janblom commented 4 months ago

@schuemie do you think this is worth taking along with the upcoming 1.0 release of whiterabbit?

I have a bit of doubt about the combination of taking a TABLESAMPLE and then order the result by ramdom. I think I get why it is done but wouldn't that cancel part of the performance improvement?