Generated cohort SQL sometimes has incorrect ranges

melissachang commented 5 years ago

I noticed this during manual testing of #341. This bug is unrelated to #341.

Louis, do you mind fixing this? Let's do it after #341 is merged. It'll be easier for me to suggest a comment if I know exactly where the fix is.

Consider this facet from 1000 Genomes Data Explorer:

For simplicity, we label the bottom facet value 0B-9B. In reality this is 0B-10B where 0B is inclusive and 10B is exclusive.

Alternatives considered:

Mathematically this is [0, 10B) but most people don't know that. Even if they did, it's not as easy to read as 0B-9B.
0B-10B, 10B-20B. The nice thing about this is it's clear where 9.5B goes. However, it's not clear where 10B goes.

The above cohort selection is 1554 participants in Data Explorer:

The generated cohort SQL is:

SELECT DISTINCT t1.participant_id FROM (SELECT participant_id FROM `verily-public-data.human_genome_variants.1000_genomes_sample_info` WHERE ((Total_Exome_Sequence >= 0 AND Total_Exome_Sequence < 9000000000))) t1

Because the generated SQL uses an incorrect range, it returns 1194 participants.

If 9000000000 is changed to 10000000000, then the SQL correctly returns 1554 participants.

lgolowich commented 5 years ago

Are you experiencing this issue with the most recent commits in #341 ? Running the latest data explorer locally for me gives the query

SELECT DISTINCT t1.participant_id FROM ((SELECT participant_id FROM `verily-public-data.human_genome_variants.1000_genomes_sample_info` WHERE (Total_Exome_Sequence >= 0 AND Total_Exome_Sequence < 10000000000))) t1

(I'd noticed this behavior when adding time series support, and edited export_url_controller.py accordingly.)

melissachang commented 5 years ago

You're right, I see this is fixed on your branch. Thanks for fixing. If there isn't one already, do you mind adding a comment to the relevant code? Thanks.

lgolowich commented 5 years ago

Adding to _get_range_clause: "Returns an SQL clause specifying that column is in the range specified by value. Uses bucket_interval to avoid potentially ambiguous ranges such as 1.0B-1.9B, which really means [1B, 2B)."

DataBiosphere / data-explorer

Generated cohort SQL sometimes has incorrect ranges #343