ImagingDataCommons / IDC-WebApp

Web Application front end for IDC (CORE REPO)
Apache License 2.0
6 stars 2 forks source link

Add a toggle to use BigQuery instead of Solr for searching #54

Closed fedorov closed 4 years ago

fedorov commented 4 years ago

We discussed this before, and I think we agreed that it would be helpful, at least while we experiment with pre-MVP, to be able to test query performance of Solr vs BigQuery. I think it would be great to have a toggle to choose the search route. Would this be possible to add?

s-paquette commented 4 years ago

This is something we can add in Sprint 8, or possibly Sprint 7 if we have time.

fedorov commented 4 years ago

We can also consider looking into BigQuery BI engine as a mechanism to speed up BQ queries, if those turn out to be too slow. https://cloud.google.com/bi-engine/docs

s-paquette commented 4 years ago

Per the current documentation, BI Engine is only available for use with Data Studio, and so wouldn't be usable (I think? but we need to dig into it) for returning query results to our front end. But perhaps they'll change that in time.

fedorov commented 4 years ago

Is this issue in the roadmap for Sprint 8?

fedorov commented 4 years ago

Per discussion, @s-paquette will also add to a shared team folder documents describing the process of populating solr with content.

s-paquette commented 4 years ago

The document is available in the Developer Documentation folder, which is in IDC Team/IDC Deliverables.

fedorov commented 4 years ago

Thank you @s-paquette! I moved the "Developer documentation" into the "Team operational documents" here. "Deliverables" is the folder we share with Todd and Keyvan, and is for external consumption.

fedorov commented 4 years ago

@s-paquette can you also add discussion in that document to where solr queries are defined in the webapp, how the front end is connected with the queries?

s-paquette commented 4 years ago

So the queries aren't defined in the WebApp anywhere; we have a query-builder method which takes a set of filters in JSON format and constructs the queries (for Solr and for BQ). These are the two helper modules we referenced some time ago in another ticket--did you want some documentation on them? (Though, I'm not sure what I would document, as they're methods that go through the filter list and then construct a query for the appropriate search backend; there's not much to describe.)

fedorov commented 4 years ago

The reason I am asking those questions is because I want to understand the process of how new facets added to the webapp gui should be propagated and into where they should be propagated. I also would like to understand the limitations of using Solr (e.g., right now a subset of columns is exported - can we export the whole BQ table with 500 or whatever columns? how will it be handling content that is stored in STRUCTs?) And it all goes back to the original request on this issue - if we could experiment with querying BQ directly, and establish current performance, perhaps all of those questions about Solr could become irrelevant.

s-paquette commented 4 years ago

Understood--that process is somewhat separate from how things are queried, and more closely related to the web framework, Django, which controls the UI. We can export the whole set of columns, it's just a matter of cost; Solr runs on VM disks which are somewhat expensive compared to BQ, at a gain of speed of response. It's better to think of Solr as a text searcher than any form of database--i.e. data stored in anything beyond simple field-value format has to be converted into JSON text, and then it can be searched.

fedorov commented 4 years ago

@G-White-ISB can you point me to the location where BQ queries are defined?

s-paquette commented 4 years ago

@fedorov The queries aren't defined, they're built based on the selected filters and available tables. The code that does this for faceted counts is here: https://github.com/ImagingDataCommons/IDC-Common/blob/master/idc_collections/collex_metadata_utils.py#L78

And the code which does it for the list of files is here: https://github.com/ImagingDataCommons/IDC-Common/blob/master/idc_collections/collex_metadata_utils.py#L268

I'm going to test adjusting the faceted count query to see if it can be sped up at all, but due to the fact that we're parallelizing the requests already, the single longest job (and the time it takes to get the data response from Google) is the rate limiting step. So I don't expect to see an improvement much below about 25s turn around. (It's about 30s right now, though some of that will be API call overhead, so if I can shrink the call count we could save a little.)

G-White-ISB commented 4 years ago

Could one track down the actual queries from a big query log on google's end? Right now we can assume there are not too many people using the dev site at one time. ' George

On Thu, Apr 30, 2020 at 11:34 PM S. Paquette notifications@github.com wrote:

@fedorov https://github.com/fedorov The queries aren't defined, they're built based on the selected filters and available tables. The code that does this for faceted counts is here:

https://github.com/ImagingDataCommons/IDC-Common/blob/master/idc_collections/collex_metadata_utils.py#L78

And the code which does it for the list of files is here:

https://github.com/ImagingDataCommons/IDC-Common/blob/master/idc_collections/collex_metadata_utils.py#L268

I'm going to test adjusting the faceted count query to see if it can be sped up at all, but due to the fact that we're parallelizing the requests already, the single longest job (and the time it takes to get the data response from Google) is the rate limiting step. So I don't expect to see an improvement much below about 25s turn around. (It's about 30s right now, though some of that will be API call overhead, so if I can shrink the call count we could save a little.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ImagingDataCommons/IDC-WebApp/issues/54#issuecomment-622269874, or unsubscribe https://github.com/notifications/unsubscribe-auth/AN54OR5YEQGM4GZBXDH7INTRPJUP5ANCNFSM4LCP3TOQ .

s-paquette commented 4 years ago

Yes, though it'd require using the BigQuery API. I don't think that you can pull up API-submitted query information in the console.

fedorov commented 4 years ago

Thank you for the pointers Suzanne, that's what I was looking for.

Yes, though it'd require using the BigQuery API. I don't think that you can pull up API-submitted query information in the console.

I actually think you can - as an example, I can see the Data Studio issued queries in the log. But the issue is that I do not have the permissions to see queries at the project level. Maybe we can check with @wlongabaugh if he can see those queries?

image

s-paquette commented 4 years ago

DataStudio is recording personal queries, I think, not queries run by a Service Account. If I log in to my owner account on the project, no project queries list:

image

It's possible there's a way to view Service Account jobs in the console, but I've not found it.

fedorov commented 4 years ago

@s-paquette would it be difficult, and would it make sense, to log the BQ query string in the console every time a query is issued? I can't think of how this could hurt, but can be quite helpful I think to debug and at least try to optimize the query. What do you think?

s-paquette commented 4 years ago

By console do you mean the logs? At the moment, that would a fairly large amount of logging--we run a separate count query for each facet, plus the document listing query. I can probably collapse it into just two (still investigating that) but it would still be a rather large log output for every time someone changes a filter.

We do print them when we need to debug, but there's not much optimizing to really do; SQL count queries amount to:

    #standardSQL
    SELECT dicom_metadata.Modality AS Modality, COUNT(DISTINCT dicom_metadata.PatientID) AS count
    FROM `idc-dev-etl.tcia.dicom_metadata` dicom_metadata 

    WHERE ((LOWER(dicom_metadata.BodyPartExamined) IN UNNEST(@BodyPartExamined_0)) OR @BodyPartExamined_0_filtering = 'not_filtering') AND ((LOWER(dicom_metadata.collection_id) IN UNNEST(@collection_id_0)) OR @collection_id_0_filtering = 'not_filtering')
    GROUP BY Modality

In a case of a continuous numeric range, like BMI or age_at_diagnosis:

    #standardSQL
    SELECT (CASE WHEN clinical.bmi < 18.5 THEN 'underweight' WHEN clinical.bmi > 30 THEN 'obese' WHEN clinical.bmi BETWEEN 18.5 AND 25 THEN 'normal weight' WHEN clinical.bmi BETWEEN 25 AND 30 THEN 'overweight' WHEN clinical.bmi IS NULL THEN 'none' END) AS bmi, COUNT(DISTINCT dicom_metadata.PatientID) AS count
    FROM `idc-dev-etl.tcia.dicom_metadata` dicom_metadata 

    JOIN `isb-cgc.TCGA_bioclin_v0.Clinical` clinical
    ON clinical.case_barcode = dicom_metadata.PatientID

    WHERE ((LOWER(dicom_metadata.BodyPartExamined) LIKE LOWER(@BodyPartExamined_0)) OR @BodyPartExamined_0_filtering = 'not_filtering') AND ((LOWER(dicom_metadata.collection_id) IN UNNEST(@collection_id_0)) OR @collection_id_0_filtering = 'not_filtering')
    GROUP BY bmi
fedorov commented 4 years ago

Thank you @s-paquette! This is very helpful. I do think there may be opportunities to optimize though.

First, you are querying the metadata table in the idc-dev-etl:tcia dataset. This is an overkill, since the dicom_metadata table in that dataset contains all of the TCIA collections that were hosted by GHC at the time table was populated. We unnecessarily penalize performance by querying data that is not going to be included in the MVP. Should we switch over to the tables discussed here sometime soon:https://github.com/ImagingDataCommons/IDC-ProjectManagement/issues/492#issuecomment-621470566 ?

Second, I think it might make sense to exploit the organization of the data and pre-generate tables that summarize facets of our interest at the level of patient/study/series. This could reduce the amount of data that would need to be queried by orders of magnitude - right now the data is extremely redundant for the purposes of our queries: we have one row for each instance, which means for a CT chest scan we can have hundreds of rows, where all of the facets we care about will have identical values, while our cohorts are built at the case level.

Can you give me a complete set of queries that are being run for some sample search? I just would like to have the exact queries to compare performance. For the UI, maybe we could have some kind of a developer toggle to get the BQ query? I think the end users might also be interested in that if they want to reproduce their search outside of the IDC cohort builder.

I am sorry if this sounds like I am trying to be a PIA, and I don't mean to add a lot of extra work for you. I just wanted to try a bit to look into what we can actually squeeze out of BQ in a meaningful way. I hope this is helpful. Let me know if it is easier to have a call to discuss this.

s-paquette commented 4 years ago

We can switch over to one of the other tables, absolutely. That's a simple change as long as the table has the columns we want. Note, though, that additional columns don't impact BQ performance, because BQ is column-based in its distribution. This means at query time, it grabs the columns you care about--which is very few, in our case, compared to the whole table--and then proceeds to the next part of the query. As a result, table width has a minimal performance impact.

We can get an improvement out of, as you said, case-collapsed tables. This is how it's set up in ISB-CGC. So of we can set up some abbreviated tables that would reduce the amount of data to process. However, the improvement is likely to be only a couple of seconds, at most. The real latency comes from submitting the jobs and waiting for them to come back, and BQ is inherently latent. That's simply how it's designed.

A toggle to print the queries into the logs would be easy enough; to return them to the UI would require a fair amount more effort. They also might not be useful to the users, as they're parametrized queries, meant to be run via an API call, and the BQ parameter structure is non-trivial JSON which can't be submitted via the console.

Ultimately the real issue is faceted counting for a UI isn't performant in a SQL system, and BigQuery wasn't designed to back a responsive UI on top of that. It's a good analytical engine, which is really where we want to use it; we can provide tables, views, and queries to users which will help them get used to using it on their own. Google might get it there eventually for UI purposes, but for now, it's not the best option in terms of user experience.

And I'm happy to do a meet to go over more of this, let me know when some good times are.

fedorov commented 4 years ago

That's a simple change as long as the table has the columns we want. Note, though, that additional columns don't impact BQ performance, because BQ is column-based in its distribution.

Yes, I understand. The table I am talking about has much fewer rows. The one you use right now has 9,803,590 rows, and the one I suggest we switch is 2,647,993.

A toggle to print the queries into the logs would be easy enough; to return them to the UI would require a fair amount more effort. They also might not be useful to the users, as they're parametrized queries, meant to be run via an API call, and the BQ parameter structure is non-trivial JSON which can't be submitted via the console.

Yes, I noticed that, and I was wondering about it. I thought (since you use UNNEST), those are arrays, but definitely need to learn more.

I get your arguments, but even if (very likely) we won't get to the comparable performance with BQ, at the very least we will have a fair comparison, and we might be able to optimize Solr data organization as well. Interface can't ever be too fast, I think.

Can you slack me to try to discuss a bit more and see if it makes sense to have a call?

fedorov commented 4 years ago

As discussed with @s-paquette, it should be quite easy to add a button to the portal that would allow user to get the non-parameterized BQ query corresponding to the current cohort. @s-paquette should this be in a separate issue, or keep the current one open?