DataBiosphere / data-explorer

BSD 3-Clause "New" or "Revised" License
10 stars 6 forks source link

Generated cohort SQL sometimes has incorrect ranges #343

Closed melissachang closed 5 years ago

melissachang commented 5 years ago

I noticed this during manual testing of #341. This bug is unrelated to #341.

Louis, do you mind fixing this? Let's do it after #341 is merged. It'll be easier for me to suggest a comment if I know exactly where the fix is.

Consider this facet from 1000 Genomes Data Explorer:

1

For simplicity, we label the bottom facet value 0B-9B. In reality this is 0B-10B where 0B is inclusive and 10B is exclusive.

Alternatives considered:

The above cohort selection is 1554 participants in Data Explorer:

1

The generated cohort SQL is:

SELECT DISTINCT t1.participant_id FROM (SELECT participant_id FROM `verily-public-data.human_genome_variants.1000_genomes_sample_info` WHERE ((Total_Exome_Sequence >= 0 AND Total_Exome_Sequence < 9000000000))) t1

Because the generated SQL uses an incorrect range, it returns 1194 participants.

If 9000000000 is changed to 10000000000, then the SQL correctly returns 1554 participants.

lgolowich commented 5 years ago

Are you experiencing this issue with the most recent commits in #341 ? Running the latest data explorer locally for me gives the query

SELECT DISTINCT t1.participant_id FROM ((SELECT participant_id FROM `verily-public-data.human_genome_variants.1000_genomes_sample_info` WHERE (Total_Exome_Sequence >= 0 AND Total_Exome_Sequence < 10000000000))) t1

(I'd noticed this behavior when adding time series support, and edited export_url_controller.py accordingly.)

melissachang commented 5 years ago

You're right, I see this is fixed on your branch. Thanks for fixing. If there isn't one already, do you mind adding a comment to the relevant code? Thanks.

lgolowich commented 5 years ago

Adding to _get_range_clause: "Returns an SQL clause specifying that column is in the range specified by value. Uses bucket_interval to avoid potentially ambiguous ranges such as 1.0B-1.9B, which really means [1B, 2B)."