Need functionality that facilitates cross study analysis

satagopam7 commented 4 years ago

We need some functionality both frontend and backend to support cross-study data pooling and/or comparison. I can provide more details if needed.

sherzinger commented 4 years ago

I thought about this a bit yesterday evening. Given the current tools and structure of Ada, I can propose this approach:

We give the user the possibility to merge two datasets into a third one. This could be done via a simple form where the user can select DataSet1, DataSet2, and the field name to use for merging (e.g. sampleID). Of course this needs to be properly designed so the user understands "why" and "how", without additional training.
In the background we use an existing or new (should be easy to implement) data transformation to achieve this. It is however important that we create a new field "source_data_set".
The user has now the option to create views, charts, or analysis by using the "source_data_set" field if separation is needed (e.g. Age Boxplot)

This is technically the most straight forward approach I can think of.

satagopam7 commented 4 years ago

Yes, more or less in the similar direction. This is very similar to views in RDBMS (Oracle, postgress). They can create different pooled datasets (analogy: views) derived from two or more source datasets. This is ‘Cross study data pooling’. Other case that also need to be address is ‘Cross study comparison’, no pooling here, but need to compare them.

On 23 Jan 2020, at 08:45, Sascha Herzinger notifications@github.com wrote:

I thought about this a bit yesterday evening. Given the current tools and structure of Ada, I can propose this approach:

We give the user the possibility to merge two datasets into a third one. This could be done via a simple form where the user can select DataSet1, DataSet2, and the field name to use for merging (e.g. sampleID). Of course this needs to be properly designed so the user understands "why" and "how", without additional training.

In the background we use an existing or new (should be easy to implement) data transformation to achieve this. It is however important that we create a new field "source_data_set".

The user has now the option to create views, charts, or analysis by using the "source_data_set" field if separation is needed (e.g. Age Boxplot)

This is technically the most straight forward approach I can think of.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ada-discovery/ada-issues/issues/145?email_source=notifications&email_token=AA4F5RBCOZM7OF64LQZNGZDQ7FDJBA5CNFSM4KKGL7F2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJWMQMA#issuecomment-577554480, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4F5RA3LPCKZS33HYLM7ZDQ7FDJBANCNFSM4KKGL7FQ.

— Dr. Venkata Satagopam Bioinformatics Core Luxembourg Centre For Systems Biomedicine (LCSB) University of Luxembourg Campus Belval, House of Biomedicine II 6, avenue du Swing L-4367 Belvaux

T +352-466-644-6421 F +352-466-644-36421 venkata.satagopam@uni.lu or satagopam@gmail.com http://lcsb.uni.lu

This message is confidential and may contain privileged information. It is intended for the named recipient only. If you receive it in error please notify me and permanently delete the original message and any copies.

sherzinger commented 4 years ago

Hey Venkata,

given the current design of Ada, comparing datasets without merging them into a third one could be considerable effort. The views, tree, analyses, dictionaries, filters, etc. are all centred around the currently selected data set. Pulling (part of) another dataset into these features would require design (and probably architectural) changes to all of them.

We could however think about making the process of merging invisible to the user. I was thinking about a tab/button "Compare Datasets" that, when clicked, allows you pick two datasets. We could also add a "Stop comparison" button, which will delete this merged dataset. The user would not even know that they operate on a new data set.

satagopam7 commented 4 years ago

Hi Sascha

I’m aware of this needs quite some effort. This needs some discussion. Let’s catchup.

On 23 Jan 2020, at 12:03, Sascha Herzinger notifications@github.com wrote:

Hey Venkata,

given the current design of Ada, comparing datasets without merging them into a third one could be considerable effort. The views, tree, analyses, dictionaries, filters, etc. are all centred around the currently selected data set. Pulling (part of) another dataset into these features would require design (and probably architectural) changes to all of them.

We could however think about making the process of merging invisible to the user. I was thinking about a tab/button "Compare Datasets" that, when clicked, allows you pick two datasets. We could also add a "Stop comparison" button, which will delete this merged dataset. The user would not even know that they operate on a new data set.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ada-discovery/ada-issues/issues/145?email_source=notifications&email_token=AA4F5RCMLS764D6EFUCGUN3Q7F2QPA5CNFSM4KKGL7F2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJW7ZTY#issuecomment-577633487, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA4F5RFKCBLTL33KJ4EJPJLQ7F2QPANCNFSM4KKGL7FQ.

— Dr. Venkata Satagopam Bioinformatics Core Luxembourg Centre For Systems Biomedicine (LCSB) University of Luxembourg Campus Belval, House of Biomedicine II 6, avenue du Swing L-4367 Belvaux

T +352-466-644-6421 F +352-466-644-36421 venkata.satagopam@uni.lu or satagopam@gmail.com http://lcsb.uni.lu

This message is confidential and may contain privileged information. It is intended for the named recipient only. If you receive it in error please notify me and permanently delete the original message and any copies.

peterbanda commented 4 years ago

For cross-study comparison there are essentially three options:

Merge transformation - That's exactly the option Sascha correctly pointed out. It is currently supported and works out-of-box (although it's allowed only for admins). This transformation can be applied to any number of data sets (not just two) where compatibility is checked by field types. There are two flavors: 1) All the fields are automatically merged by name. Note that not necessarily all the fields need to be present in each data set (could be unique). 2) By manual linking the fields by name. Regardless of the transformation type, the result is a new data set with its own dictionary, views, filters, etc.
Data source virtualization - This is something I presented a couple of times and would combine several repo sources into one and effectively hide them. The final union would again implement the CRUR repo interface so would be pluggable wherever a single data source is currently supported. This has already been reported at #4 . The main implementation problem would be sorting (e.g. for box plots), also partially the offset and limit operators could be tricky. For Akka streaming the best approach, preserving order, would be to employ the mergeSorted function, currently used for the optimized linking transformation (not yet released).
Multi Source Visualization - We can of course allow to generate different widgets (charts) in a single view from different sources, which would mean integration at the visualization level (as was done by Fractalis). This can be supported rather quickly (some ad-hoc experiments along these lines worked) but the resulting artifacts/views would not be fitting into a single data set abstraction. Therefore new meta data, tree node/type, controllers, and permissions would need to be introduced (kind of ad-hoc). Also to allow a closer comparison of field values from different data sets (studies) introducing a multi-field distribution widget would be quite handy (low hanging fruit). Already reported at #50 .

Naturally, as it’s probably implied, proper harmonization is expected for all the presented options. Moreover, the solutions 1 and 3 don’t necessarily need that matching fields have the same names, wheres the solution 2 would most likely require that (to have a clean impl).

sherzinger commented 4 years ago

Hi @peterbanda,

Could you show us (maybe in the meeting next week?) what you did with 2.? I've not seen that yet I think.

Regarding Option 3.: This is actually exactly the type of issues I was referring to further up, albeit in less technical language. Technically, injecting some data into a widget is relatively easy, as you mentioned. The problems come from the everything else:

How do filters work in this case?
Do we have to prefix every single field with the source dataset, in order to show to the user where the field came from?
What happens when you update the view by clicking on the widgets?
What happens when you update the view by clicking on the foreign study field e.g. in the pie chart?
Will it still be possible to save a view?
How do we indicate everywhere in the UI to which studies the filters, fields, views belong to?
What happens if a multi-dataset view is saved (somehow) and access to the dataset is removed?
How do you specify filters for the fields you pull in from the other dataset?
Do we need to modify every single widget, such that they can account for the new dimension "source study" alongside e.g. "gender"?

Just some of the questions that came to my mind, and this is largely just UI design. As you correctly pointed out this would also needs to be addressed on an architectural level in many locations.

Maybe limiting option 3 to single analyses/charts (not within a view!) would be doable in a reasonable amount of time if that satisfies the requirements?

And just to underline the fact: Option 1 is already there. We can compare datasets. It just needs to be wrapped in a user friendly interface.

sherzinger commented 4 years ago

Note: I discussed the issue with Venkata and I think we came to an agreement. I'll prepare a mockup for the next meeting, so we can talk about it in detail.

ada-discovery / ada-issues