DataBiosphere / azul

Metadata indexer and query service used for AnVIL, HCA, LungMAP, and CGP
Apache License 2.0
7 stars 2 forks source link

Queries sometimes return no rows #6667

Open nadove-ucsc opened 5 days ago

nadove-ucsc commented 5 days ago

During the reindex of the nadove6 deployment on 10/28/2024, this query returned no rows, raising a RequirementError:

@requestId | cb0ccc8b-df5e-5d6a-a7db-9311068e629e
@timestamp | 1730205653743
job_id | a23a055e-eed3-4fd2-b567-7ad1878b3cb1
query | SELECT anatomical_site, apriori_cell_type, biosample_id, biosample_type, datarepo_row_id, disease, donor_age_at_collection_lower_bound, donor_age_at_collection_unit, donor_age_at_collection_upper_bound, donor_id, part_of_dataset_id, source_datarepo_row_ids                FROM `datarepo-7268c3a0.ANVIL_ccdg_broad_ai_ibd_daly_kupcinskas_gsa_20240311_ANV5_202403121627.anvil_biosample`                WHERE biosample_id IN ('2d4fd521-b5a2-315c-e269-d44ccd845faf')
stats.searchStatistics.indexUnusedReasons.0.code | NOT_SUPPORTED_IN_STANDARD_EDITION
stats.searchStatistics.indexUnusedReasons.0.message | Index can not be used for query with Standard edition reservation. See https://cloud.google.com/bigquery/docs/editions-intro for more information.
stats.searchStatistics.indexUsageMode | UNUSED
total_rows | 0

The subsequent retry succeeded. It was reported as a cache hit, despite the result being different:

@requestId | 0a40bf45-815b-5ef3-88f2-e0df5ba5f012
@timestamp | 1730205930681
job_id | a785ecdc-d637-4dd0-a552-22ca650d50e9
query | SELECT anatomical_site, apriori_cell_type, biosample_id, biosample_type, datarepo_row_id, disease, donor_age_at_collection_lower_bound, donor_age_at_collection_unit, donor_age_at_collection_upper_bound, donor_id, part_of_dataset_id, source_datarepo_row_ids                FROM datarepo-7268c3a0.ANVIL_ccdg_broad_ai_ibd_daly_kupcinskas_gsa_20240311_ANV5_202403121627.anvil_biosample`                WHERE biosample_id IN ('2d4fd521-b5a2-315c-e269-d44ccd845faf')
stats.cacheHit | 1
stats.searchStatistics.indexUnusedReasons.0.code | QUERY_CACHE_HIT
stats.searchStatistics.indexUnusedReasons.0.message | Search indexes are not used because the query was cached.
stats.searchStatistics.indexUsageMode | UNUSED
stats.totalBytesBilled | 0
stats.totalBytesProcessed | 0
total_rows | 1

Errors like this one happened 64 times during the reindex. Every time this exception was raised, the row count was zero (there were no occurrences of an incomplete but nonempty result).

nadove-ucsc commented 5 days ago

Note that the first query (which returned no rows) is lacking the stats.totalBytesBilled and stats.totalBytesProcessed fields. This might indicate that we're reading the rows too soon, before the query has completed.