Closed melissachang closed 5 years ago
Are you experiencing this issue with the most recent commits in #341 ? Running the latest data explorer locally for me gives the query
SELECT DISTINCT t1.participant_id FROM ((SELECT participant_id FROM `verily-public-data.human_genome_variants.1000_genomes_sample_info` WHERE (Total_Exome_Sequence >= 0 AND Total_Exome_Sequence < 10000000000))) t1
(I'd noticed this behavior when adding time series support, and edited export_url_controller.py
accordingly.)
You're right, I see this is fixed on your branch. Thanks for fixing. If there isn't one already, do you mind adding a comment to the relevant code? Thanks.
Adding to _get_range_clause: "Returns an SQL clause specifying that column is in the range specified by value. Uses bucket_interval to avoid potentially ambiguous ranges such as 1.0B-1.9B, which really means [1B, 2B)."
I noticed this during manual testing of #341. This bug is unrelated to #341.
Louis, do you mind fixing this? Let's do it after #341 is merged. It'll be easier for me to suggest a comment if I know exactly where the fix is.
Consider this facet from 1000 Genomes Data Explorer:
For simplicity, we label the bottom facet value
0B-9B
. In reality this is0B-10B
where 0B is inclusive and 10B is exclusive.Alternatives considered:
[0, 10B)
but most people don't know that. Even if they did, it's not as easy to read as0B-9B
.0B-10B
,10B-20B
. The nice thing about this is it's clear where 9.5B goes. However, it's not clear where 10B goes.The above cohort selection is 1554 participants in Data Explorer:
The generated cohort SQL is:
Because the generated SQL uses an incorrect range, it returns 1194 participants.
If
9000000000
is changed to10000000000
, then the SQL correctly returns 1554 participants.