Closed fedorov closed 4 years ago
This is something we can add in Sprint 8, or possibly Sprint 7 if we have time.
We can also consider looking into BigQuery BI engine as a mechanism to speed up BQ queries, if those turn out to be too slow. https://cloud.google.com/bi-engine/docs
Per the current documentation, BI Engine is only available for use with Data Studio, and so wouldn't be usable (I think? but we need to dig into it) for returning query results to our front end. But perhaps they'll change that in time.
Is this issue in the roadmap for Sprint 8?
Per discussion, @s-paquette will also add to a shared team folder documents describing the process of populating solr with content.
The document is available in the Developer Documentation folder, which is in IDC Team/IDC Deliverables.
Thank you @s-paquette! I moved the "Developer documentation" into the "Team operational documents" here. "Deliverables" is the folder we share with Todd and Keyvan, and is for external consumption.
@s-paquette can you also add discussion in that document to where solr queries are defined in the webapp, how the front end is connected with the queries?
So the queries aren't defined in the WebApp anywhere; we have a query-builder method which takes a set of filters in JSON format and constructs the queries (for Solr and for BQ). These are the two helper modules we referenced some time ago in another ticket--did you want some documentation on them? (Though, I'm not sure what I would document, as they're methods that go through the filter list and then construct a query for the appropriate search backend; there's not much to describe.)
The reason I am asking those questions is because I want to understand the process of how new facets added to the webapp gui should be propagated and into where they should be propagated. I also would like to understand the limitations of using Solr (e.g., right now a subset of columns is exported - can we export the whole BQ table with 500 or whatever columns? how will it be handling content that is stored in STRUCTs?) And it all goes back to the original request on this issue - if we could experiment with querying BQ directly, and establish current performance, perhaps all of those questions about Solr could become irrelevant.
Understood--that process is somewhat separate from how things are queried, and more closely related to the web framework, Django, which controls the UI. We can export the whole set of columns, it's just a matter of cost; Solr runs on VM disks which are somewhat expensive compared to BQ, at a gain of speed of response. It's better to think of Solr as a text searcher than any form of database--i.e. data stored in anything beyond simple field-value format has to be converted into JSON text, and then it can be searched.
@G-White-ISB can you point me to the location where BQ queries are defined?
@fedorov The queries aren't defined, they're built based on the selected filters and available tables. The code that does this for faceted counts is here: https://github.com/ImagingDataCommons/IDC-Common/blob/master/idc_collections/collex_metadata_utils.py#L78
And the code which does it for the list of files is here: https://github.com/ImagingDataCommons/IDC-Common/blob/master/idc_collections/collex_metadata_utils.py#L268
I'm going to test adjusting the faceted count query to see if it can be sped up at all, but due to the fact that we're parallelizing the requests already, the single longest job (and the time it takes to get the data response from Google) is the rate limiting step. So I don't expect to see an improvement much below about 25s turn around. (It's about 30s right now, though some of that will be API call overhead, so if I can shrink the call count we could save a little.)
Could one track down the actual queries from a big query log on google's end? Right now we can assume there are not too many people using the dev site at one time. ' George
On Thu, Apr 30, 2020 at 11:34 PM S. Paquette notifications@github.com wrote:
@fedorov https://github.com/fedorov The queries aren't defined, they're built based on the selected filters and available tables. The code that does this for faceted counts is here:
And the code which does it for the list of files is here:
I'm going to test adjusting the faceted count query to see if it can be sped up at all, but due to the fact that we're parallelizing the requests already, the single longest job (and the time it takes to get the data response from Google) is the rate limiting step. So I don't expect to see an improvement much below about 25s turn around. (It's about 30s right now, though some of that will be API call overhead, so if I can shrink the call count we could save a little.)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ImagingDataCommons/IDC-WebApp/issues/54#issuecomment-622269874, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN54OR5YEQGM4GZBXDH7INTRPJUP5ANCNFSM4LCP3TOQ .
Yes, though it'd require using the BigQuery API. I don't think that you can pull up API-submitted query information in the console.
Thank you for the pointers Suzanne, that's what I was looking for.
Yes, though it'd require using the BigQuery API. I don't think that you can pull up API-submitted query information in the console.
I actually think you can - as an example, I can see the Data Studio issued queries in the log. But the issue is that I do not have the permissions to see queries at the project level. Maybe we can check with @wlongabaugh if he can see those queries?
DataStudio is recording personal queries, I think, not queries run by a Service Account. If I log in to my owner account on the project, no project queries list:
It's possible there's a way to view Service Account jobs in the console, but I've not found it.
@s-paquette would it be difficult, and would it make sense, to log the BQ query string in the console every time a query is issued? I can't think of how this could hurt, but can be quite helpful I think to debug and at least try to optimize the query. What do you think?
By console do you mean the logs? At the moment, that would a fairly large amount of logging--we run a separate count query for each facet, plus the document listing query. I can probably collapse it into just two (still investigating that) but it would still be a rather large log output for every time someone changes a filter.
We do print them when we need to debug, but there's not much optimizing to really do; SQL count queries amount to:
#standardSQL
SELECT dicom_metadata.Modality AS Modality, COUNT(DISTINCT dicom_metadata.PatientID) AS count
FROM `idc-dev-etl.tcia.dicom_metadata` dicom_metadata
WHERE ((LOWER(dicom_metadata.BodyPartExamined) IN UNNEST(@BodyPartExamined_0)) OR @BodyPartExamined_0_filtering = 'not_filtering') AND ((LOWER(dicom_metadata.collection_id) IN UNNEST(@collection_id_0)) OR @collection_id_0_filtering = 'not_filtering')
GROUP BY Modality
In a case of a continuous numeric range, like BMI or age_at_diagnosis:
#standardSQL
SELECT (CASE WHEN clinical.bmi < 18.5 THEN 'underweight' WHEN clinical.bmi > 30 THEN 'obese' WHEN clinical.bmi BETWEEN 18.5 AND 25 THEN 'normal weight' WHEN clinical.bmi BETWEEN 25 AND 30 THEN 'overweight' WHEN clinical.bmi IS NULL THEN 'none' END) AS bmi, COUNT(DISTINCT dicom_metadata.PatientID) AS count
FROM `idc-dev-etl.tcia.dicom_metadata` dicom_metadata
JOIN `isb-cgc.TCGA_bioclin_v0.Clinical` clinical
ON clinical.case_barcode = dicom_metadata.PatientID
WHERE ((LOWER(dicom_metadata.BodyPartExamined) LIKE LOWER(@BodyPartExamined_0)) OR @BodyPartExamined_0_filtering = 'not_filtering') AND ((LOWER(dicom_metadata.collection_id) IN UNNEST(@collection_id_0)) OR @collection_id_0_filtering = 'not_filtering')
GROUP BY bmi
Thank you @s-paquette! This is very helpful. I do think there may be opportunities to optimize though.
First, you are querying the metadata table in the idc-dev-etl:tcia
dataset. This is an overkill, since the dicom_metadata
table in that dataset contains all of the TCIA collections that were hosted by GHC at the time table was populated. We unnecessarily penalize performance by querying data that is not going to be included in the MVP. Should we switch over to the tables discussed here sometime soon:https://github.com/ImagingDataCommons/IDC-ProjectManagement/issues/492#issuecomment-621470566 ?
Second, I think it might make sense to exploit the organization of the data and pre-generate tables that summarize facets of our interest at the level of patient/study/series. This could reduce the amount of data that would need to be queried by orders of magnitude - right now the data is extremely redundant for the purposes of our queries: we have one row for each instance, which means for a CT chest scan we can have hundreds of rows, where all of the facets we care about will have identical values, while our cohorts are built at the case level.
Can you give me a complete set of queries that are being run for some sample search? I just would like to have the exact queries to compare performance. For the UI, maybe we could have some kind of a developer toggle to get the BQ query? I think the end users might also be interested in that if they want to reproduce their search outside of the IDC cohort builder.
I am sorry if this sounds like I am trying to be a PIA, and I don't mean to add a lot of extra work for you. I just wanted to try a bit to look into what we can actually squeeze out of BQ in a meaningful way. I hope this is helpful. Let me know if it is easier to have a call to discuss this.
We can switch over to one of the other tables, absolutely. That's a simple change as long as the table has the columns we want. Note, though, that additional columns don't impact BQ performance, because BQ is column-based in its distribution. This means at query time, it grabs the columns you care about--which is very few, in our case, compared to the whole table--and then proceeds to the next part of the query. As a result, table width has a minimal performance impact.
We can get an improvement out of, as you said, case-collapsed tables. This is how it's set up in ISB-CGC. So of we can set up some abbreviated tables that would reduce the amount of data to process. However, the improvement is likely to be only a couple of seconds, at most. The real latency comes from submitting the jobs and waiting for them to come back, and BQ is inherently latent. That's simply how it's designed.
A toggle to print the queries into the logs would be easy enough; to return them to the UI would require a fair amount more effort. They also might not be useful to the users, as they're parametrized queries, meant to be run via an API call, and the BQ parameter structure is non-trivial JSON which can't be submitted via the console.
Ultimately the real issue is faceted counting for a UI isn't performant in a SQL system, and BigQuery wasn't designed to back a responsive UI on top of that. It's a good analytical engine, which is really where we want to use it; we can provide tables, views, and queries to users which will help them get used to using it on their own. Google might get it there eventually for UI purposes, but for now, it's not the best option in terms of user experience.
And I'm happy to do a meet to go over more of this, let me know when some good times are.
That's a simple change as long as the table has the columns we want. Note, though, that additional columns don't impact BQ performance, because BQ is column-based in its distribution.
Yes, I understand. The table I am talking about has much fewer rows. The one you use right now has 9,803,590 rows, and the one I suggest we switch is 2,647,993.
A toggle to print the queries into the logs would be easy enough; to return them to the UI would require a fair amount more effort. They also might not be useful to the users, as they're parametrized queries, meant to be run via an API call, and the BQ parameter structure is non-trivial JSON which can't be submitted via the console.
Yes, I noticed that, and I was wondering about it. I thought (since you use UNNEST
), those are arrays, but definitely need to learn more.
I get your arguments, but even if (very likely) we won't get to the comparable performance with BQ, at the very least we will have a fair comparison, and we might be able to optimize Solr data organization as well. Interface can't ever be too fast, I think.
Can you slack me to try to discuss a bit more and see if it makes sense to have a call?
As discussed with @s-paquette, it should be quite easy to add a button to the portal that would allow user to get the non-parameterized BQ query corresponding to the current cohort. @s-paquette should this be in a separate issue, or keep the current one open?
We discussed this before, and I think we agreed that it would be helpful, at least while we experiment with pre-MVP, to be able to test query performance of Solr vs BigQuery. I think it would be great to have a toggle to choose the search route. Would this be possible to add?