DataBiosphere / data-explorer

BSD 3-Clause "New" or "Revised" License
11 stars 6 forks source link

Export to Saturn changes requested by Matt #289

Closed melissachang closed 5 years ago

melissachang commented 5 years ago

Matt requested some changes to make the experience smoother for AMP PD researchers.

Don't export samples and sample sets

samples are confusing -- why are we exporting all samples when I selected a cohort? sample sets are confusing -- why are cohorts not materialized (a SQL query) but samples sets are?

We materialized sample sets so that users can run workflows in Saturn. Initially, there will not be an emphasis on running workflows for AMP PD. Later on, if AMP PD researchers want to run workflows, we can revisit. We could add a checkbox to the cohort name dialog like "Export to a sample set for running workflows?"

Continue to export BigQuery tables. Shorten BigQuery_table_id

The long-term plan for cohorts is: A cohort represents a table with specific rows (WHERE clauses) and columns (FROM clauses). The SQL query returns these rows and columns. A cohort is associated with a set of BigQuery tables in the FROM clauses. For example, a cohort might contain only columns from a Demographics table.

For now, a cohort is simply a set of participant ids -- WHERE clauses with no FROM clauses. The sql query returns only a list of participants, with no other columns. Users will join the "set of participant ids" with whatever table/columns they're interested in, in a notebook. The Bigquery table entities lists the available tables to join against.

Change BigQuery_table_id from verily-public-data_human_genome_variants_1000_genomes_participant_info to 1000_genomes_participant_info.

Background: Currently we export:

BigQuery_table_id                                                            table_name
verily-public-data_human_genome_variants_1000_genomes_participant_info  verily-public-data.human_genome_variants.1000_genomes_participant_info
verily-public-data_human_genome_variants_1000_genomes_sample_info   verily-public-data.human_genome_variants.1000_genomes_sample_info

The two columns are redundant. Ideally there would just be one column:

BigQuery_table_id
verily-public-data.human_genome_variants.1000_genomes_participant_info
verily-public-data.human_genome_variants.1000_genomes_sample_info

But that wasn't possible because entity names can't have .. (I believe with the new entity service, this will be possible.) Entity attributes can have ., so the table_name attribute has the correct qualified BigQuery table name.

Matt suggested this which looks cleaner:

BigQuery_table_id             table_name
1000_genomes_participant_info   verily-public-data.human_genome_variants.1000_genomes_participant_info
1000_genomes_sample_info    verily-public-data.human_genome_variants.1000_genomes_sample_info

Add dataset_name column to BigQuery table and cohort entities

In the future, one could export multiple datasets into the same workspace. It would be nice to have a dataset_name column so for example, one knew what are the AMP PD BigQuery tables for joining with an AMP PD cohort. Let's use dataset name from dataset.json.

Always show cohort name dialog

Current: we only show cohort name dialog is a cohort was selected.

New: If no cohort is selected, show cohort name dialog with name "all participants". User can edit this name if they want. This cohort's SQL query returns all participants.

melissachang commented 5 years ago

The changes have been deployed. @mbookman , let us know if you have any other feedback