MuSiC Tutorial - Githubissues

mtekman commented 2 years ago

This is a work-in-progress of a MuSiC tutorial for bulk RNA deconvolution

@nomadscientist

TODO:

[x] Update tool links to latest music_deconvolution version after tool upgrade

mtekman commented 2 years ago

@nomadscientist First full version of tutorial is finished. Please review and improve!

nomadscientist commented 2 years ago

@nomadscientist First full version of tutorial is finished. Please review and improve!

You're amazing!

nomadscientist commented 2 years ago

@mtekman Do you think we should keep the 'import data via data library' thing? I generally delete those on my tutorials. Is there a reason to keep it?

nomadscientist commented 2 years ago

How were the datasets initially created? i.e. are they in a format that is an output of a Galaxy tutorial, or something we could pull directly from EBI, for instance?

nomadscientist commented 2 years ago

[ ] Workflow for generating scRNA input
[ ] Workflow for generating bulk RNA input

nomadscientist commented 2 years ago

Beautiful question / inspecting data section, really guides the user from AHH OVERWHELMED to OH THIS IS FINE. Nice work there

nomadscientist commented 2 years ago

@mtekman Construct Expression Set Object not working - https://humancellatlas.usegalaxy.eu/u/wendi.bacon/h/deconvolution-cell-type-inference-of-human-pancreas-data

mtekman commented 2 years ago

@mtekman Do you think we should keep the 'import data via data library' thing? I generally delete those on my tutorials. Is there a reason to keep it?

I think they're automatically there because I guess different Galaxy instances might have the data hosted locally (e.g. for workshops). I say leave it, since it's pretty much in all tutorials.

mtekman commented 2 years ago

How were the datasets initially created?

The input datasets are the raw count matrices and phenotype table pulled out of the RData objects from the example datasets (https://zenodo.org/record/5554814).

There's a section in the tutorial exploring the datasets where we discuss how the input data is structured, and how the count matrices are related to the phenotype tables.

This could maybe made a little bit more clearer -- all that really needs to be said there is that we just required two tables: expression counts, and a phenotype table describing the samples.

i.e. are they in a format that is an output of a Galaxy tutorial, or something we could pull directly from EBI, for instance?

Yes, and yes. Both the count matrices and the phenotypes should be able to be extracted from the AnnData objects using Galaxy tools (in the case of the single cell data). For the bulk data, I think this is already in tabular format.

mtekman commented 2 years ago

Beautiful question / inspecting data section, really guides the user from AHH OVERWHELMED to OH THIS IS FINE. Nice work there

Hehe! If you know of a way to ease this transition more, then please make the required changes. I was describing it in a way that made sense to me, but my thinking is often at complete odds to others :P

nomadscientist commented 2 years ago

Beautiful question / inspecting data section, really guides the user from AHH OVERWHELMED to OH THIS IS FINE. Nice work there

Hehe! If you know of a way to ease this transition more, then please make the required changes. I was describing it in a way that made sense to me, but my thinking is often at complete odds to others :P

LOL - this was nice and academic of you, I give you a complement and you turn it into a critique

mtekman commented 2 years ago

@mtekman Construct Expression Set Object not working - https://humancellatlas.usegalaxy.eu/u/wendi.bacon/h/deconvolution-cell-type-inference-of-human-pancreas-data

It looks like it's still using the old version of the tool (0.1.1+galaxy0) instead of (0.1.1+galaxy1), which had some issues before, so this is expected.

We need to refresh this. It was hosted under the "Testing Tools" subsection of the tool list, but it doesn't seem to be there anymore.

@bgruening Any chance of a refresh for the MuSiC tools? They run fine locally on my machine, and they're ready to replace the current broken version.

bgruening commented 2 years ago

Its installed now, normal updates of all tools are running this weekend.

nomadscientist commented 2 years ago

This bit is not letting me select a dataset... nor is it automatically picking up the Rdatasets...

nomadscientist commented 2 years ago

Also the tools look doubled?

nomadscientist commented 2 years ago

Same issue here - even new tool fails to let me select inputs

nomadscientist commented 2 years ago

OK so I got stuck on the tutorial tools not working, so I turned to looking at the dataset generation workflows

With the scRNA-seq data production, I see how you can download it from ArrayExpress and it works fine. BUT, what if they are using their own dataset? Then it would be processed (i.e. not integer counts), and I can't figure out how to create an integer-count matrix from a scRNA-seq AnnData object. Do you know how to do this? Here's the answer key scRNA-seq history: https://docs.galaxyproject.org/en/latest/releases/20.01_announce_user.html
With the bulk RNA-seq data production, I looked at its repository here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE50244. But none of those files have integers. How was the dataset for the tutorial created? The MUSIC original vignette doesn't actually say. The good news here, however, is that the Galaxy RNA-seq reads to counts tutorial gives the data in the exact correct format, (with some minor adding of columns to the sample information table), so that's a massive win!

mtekman commented 2 years ago

This bit is not letting me select a dataset... nor is it automatically picking up the Rdatasets...

Same issue here - even new tool fails to let me select inputs

Yep - for now you need to physically drag and drop it from the history into the input slot. It's a known bug that should be fixed in the next release when the rdata.eset datatype is included in Galaxy.

mtekman commented 2 years ago

Also the tools look doubled?

Yep, also temporary - some issues with galaxy at the moment. One of the doubled tools should be the correct version, and one should be the older version.

It's nothing to worry about, it'll be gone by the next workshop

mtekman commented 2 years ago

OK so I got stuck on the tutorial tools not working, so I turned to looking at the dataset generation workflows

1. With the scRNA-seq data production, I see how you can download it from ArrayExpress and it works fine. BUT, what if they are using their own dataset? Then it would be processed (i.e. not integer counts), and I can't figure out how to create an integer-count matrix from a scRNA-seq AnnData object. Do you know how to do this? Here's the answer key scRNA-seq history: https://docs.galaxyproject.org/en/latest/releases/20.01_announce_user.html

Integer or decimal counts can both be used AFAIK. You can extract the full raw integer count matrix from an AnnData object in Galaxy via the Inspect AnnData tool

![Screenshot 2021-12-22 at 12-08-19 Galaxy Europe](https://user-images.githubusercontent.com/20641402/147083999-2d9f92df-c1d6-4e55-8332-4d1a8cf58f61.

You can also extract the normalised matrix using the same tool (I think)

2. With the bulk RNA-seq data production, I looked at its repository here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE50244. But none of those files have integers. How was the dataset for the tutorial created?
   <img alt="Screenshot 2021-12-22 at 09 52 30" width="481" src="https://user-images.githubusercontent.com/44605769/147073320-d54d2111-b25d-4c13-b725-a405a24e7ed4.png"> 
The MUSIC original vignette doesn't actually say. The good news here, however, is that the Galaxy RNA-seq reads to counts tutorial gives the data in the exact correct format, (with some minor adding of columns to the sample information table), so that's a massive win!

I'm not sure the integer counts are necessary, I think MuSiC can work on both normalised and unnormalised matrices. All the RData datasets in the tutorials were generated using a count matrix and phenotype table pair to create the R objects for the rest of the tutorial, but it's literally just those two plain-text tabular inputs in a compact form.

MMmm maybe we might need another tutorial on how to structure phenotype data for MuSiC to work, but for the most part MuSiC just expects a count matrix and some accompanying table describing the samples. I think it should just work with everything

nomadscientist commented 2 years ago

This bit is not letting me select a dataset... nor is it automatically picking up the Rdatasets...

Same issue here - even new tool fails to let me select inputs

Yep - for now you need to physically drag and drop it from the history into the input slot. It's a known bug that should be fixed in the next release when the rdata.eset datatype is included in Galaxy.

Drag and drop is not working for me - does it work for you?

nomadscientist commented 2 years ago

Inspect AnnData only lets you extract the full data matrix, which I believe does not default to raw.

However, this matters less if non-integers works, that's brilliant news!!! I will have a play with this when the tools are working to verify for my own brain :)

I think that's great, my issue is, we can't actually say how the data was generated (which is a problem in the original vignette, they must have done SOMETHING to the raw data, because what is currently available is not in the integer form used in the tutorial). I might make a proof of principle history (in January) to show how all 3 tutorials (RNA-seq, scRNA-seq, and MUSIC) can work together... But I can add this after the tutorial is published, this concern is minor and definitely not necessary to solve before publishing this tutorial.

mtekman commented 2 years ago

Drag and drop is not working for me - does it work for you?

Yep, works for me!

test

Inspect AnnData only lets you extract the full data matrix, which I believe does not default to raw.

Oh whoops you're right, I guess the raw is only saved in some backup slot, but I think that can be accessed via the uns selector in the Inspect AnnData tool

However, this matters less if non-integers works, that's brilliant news!!! I will have a play with this when the tools are working to verify for my own brain :)

Okay cool!

I think that's great, my issue is, we can't actually say how the data was generated (which is a problem in the original vignette, they must have done SOMETHING to the raw data, because what is currently available is not in the integer form used in the tutorial).

Oh I see the problem.... but is that really an issue? This is a tool that will be used well after pre-processing and after the downstream analysis, so the data will likely be modified many times over before it gets to the MuSiC stage me thinks

I might make a proof of principle history (in January) to show how all 3 tutorials (RNA-seq, scRNA-seq, and MUSIC) can work together... But I can add this after the tutorial is published, this concern is minor and definitely not necessary to solve before publishing this tutorial.

Yep - that would be a fantastic cradle-to-grave workflow, could be used as a standalone workshop!

nomadscientist commented 2 years ago

Such a great demonstration of why tags are awesome

nomadscientist commented 2 years ago

[ ] The tool should be able to run without selecting a phenotype target.

nomadscientist commented 2 years ago

I can already see how it would be very (very) important for users to be able to specify components of the charts (for example, setting all of their y axes to 1).

[ ] Can these parameters be included?

nomadscientist commented 2 years ago

I don't understand how this chart is generated - where do we tick the box that says 'look in the beta cell proportions'?

nomadscientist commented 2 years ago

On that same avenue, researchers will want to be able to input a phenotype (so imagine they were analysing two sets of samples, one with disease, and one without)

[ ] Rather than stipulating a 'phenotype target', can they stipulate a metadata column?
[ ] Where does 'TSD' come from? I thought all the bulk RNA-seq samples were from healthy participants... were some of them labelled TSD? -
[ ] Change label of dataset from 'healthy' (as the #scrna contains both healthy and TSD)

mtekman commented 2 years ago

* [ ]  The tool should be able to run without selecting a phenotype target.

I'm not sure if that's possible.... but I'll dig into the code, maybe I can split it out.

I can already see how it would be very (very) important for users to be able to specify components of the charts (for example, setting all of their y axes to 1).
* [ ]  Can these parameters be included?

Likely!

I don't understand how this chart is generated - where do we tick the box that says 'look in the beta cell proportions'?

This is the phenotype factor (or metadata column) from the rna-seq data given in the first image you shared above. The user would write hba1c and this factor would be treated as the X-axis in the cell type proportion plot.

On that same avenue, researchers will want to be able to input a phenotype (so imagine they were analysing two sets of samples, one with disease, and one without)
* [ ]  Rather than stipulating a 'phenotype target', can they stipulate a metadata column?

So I could likely change it so that they can just specify a column that specifically labels a sample as "diseased" or "healthy", but I like the idea of specifying a factor for it (like hba1c), because it changes the designation from a binary classification into a more linear one, where you can effectively set the threshold at which someone becomes diseased (in this case, when they have a hba1c level greater than 6.5%).

Also, not changing it means I don't need to mess with the internal code too much ;-P

* [ ]  Where does 'TSD' come from? I thought all the bulk RNA-seq samples were from healthy participants... were some of them labelled TSD? -

* [ ]  Change label of dataset from 'healthy' (as the #scrna contains both healthy and TSD)

T2D is the disease factor given from the metadata column of the scrna-seq this time.

https://github.com/galaxyproject/training-material/blob/3e1552b0be5291fa429068e6f4b67b47edce68fa/topics/transcriptomics/tutorials/bulk-music/tutorial.md?plain=1#L218

This I believe is a binary classifier, but it could be anything the user wishes to use to compare a disease status in an scRNA dataset against another disease status in the RNA-seq dataset.

As I type this out, I realise how unclear that is

nomadscientist commented 2 years ago

SO CLOSE!

The last set of 'construct expression objects' isn't working url: https://humancellatlas.usegalaxy.eu/u/wendi.bacon/h/deconvolution-dendrogram-of-mouse-data

mtekman commented 2 years ago

That large dataset shouldn't be used in the tutorial, so not sure why you downloaded it :P

That being said, good that you tested it because it shows either a weakness in the code or a weirdness to the dataset.

The error message says:

Number of Samples between phenotypes and assays are not the same

and when I download the large files and test the first 6 sample names in each set, I see:

> head(rownames(phenotypedata), 6)
[1] "TGGTTCCGTCGGCTCA-2" "CGAGCCAAGCGTCAAG-4" "GAATGAAGTTTGGGCC-5"
[4] "CTCGTACGTTGCCTCT-7" "TTCTCAATCCACGCAG-5" "CCTTCCCCATACCATG-4"
> head(colnames(expressiondata), 6)
[1] "TGGTTCCGTCGGCTCA.2" "CGAGCCAAGCGTCAAG.4" "GAATGAAGTTTGGGCC.5"
[4] "CTCGTACGTTGCCTCT.7" "TTCTCAATCCACGCAG.5" "CCTTCCCCATACCATG.4"

so the difference is that phenotypes use a - and the expression set uses a . for the sample names, which is hilarious.

Here was me adding checks to the code for stringency, and it fails on their own major datasets.

I guess I could add a tickbox to the tool that disables the sample stringency test

mtekman commented 2 years ago

Okay, so I have no time this week to do anything - but I've made a checklist for next week based on your review (and thanks!):

[ ] Add y-axis scale fix
[ ] Labelling fix of "healthy" in tutorial
[ ] Remove the need to specify a disease metacolumn phenotype in the bulk
- Need to understand exactly whether the deconvolution can operate without it
[ ] Add "disable sample stringency check" for large datasets with slightly differing sample names

if I've missed anything, let me know!

mtekman commented 2 years ago

The problem we're likely to run into is that the this MuSiC tool appears to be a specific RNA deconvolution tool to help understand a specific phenotype, and we're trying to generalise it as much as we can.

I'm not sure how easy this will be, and it took a lot of head-scratching to detangle it from the hba1c phenotype target. Future users might just need to create some kind of disease factor in their metadata for it to work (I realise how dumb that sounds)

nomadscientist commented 2 years ago

That is HILARIOUS! Ok, but also, I am using the data linked in the tutorial -

Is this the wrong dataset?

nomadscientist commented 2 years ago

I don't understand how this chart is generated - where do we tick the box that says 'look in the beta cell proportions'?

This is the phenotype factor (or metadata column) from the rna-seq data given in the first image you shared above. The user would write hba1c and this factor would be treated as the X-axis in the cell type proportion plot.

Yes, but why beta cells? There are 6 different cell types, how was 'beta' specified? What if I wanted to know about the alpha cell types? In fact it's weird it doesn't spit out all the cell types, quite frankly.

On that same avenue, researchers will want to be able to input a phenotype (so imagine they were analysing two sets of samples, one with disease, and one without)
* [ ]  Rather than stipulating a 'phenotype target', can they stipulate a metadata column?
So I could likely change it so that they can just specify a column that specifically labels a sample as "diseased" or "healthy", but I like the idea of specifying a factor for it (like hba1c), because it changes the designation from a binary classification into a more linear one, where you can effectively set the threshold at which someone becomes diseased (in this case, when they have a hba1c level greater than 6.5%).

Also, not changing it means I don't need to mess with the internal code too much ;-P

Don't hate me, but it would be very valuable to do both...

* [ ]  Where does 'TSD' come from? I thought all the bulk RNA-seq samples were from healthy participants... were some of them labelled TSD? -

* [ ]  Change label of dataset from 'healthy' (as the #scrna contains both healthy and TSD)

T2D is the disease factor given from the metadata column of the scrna-seq this time.

So that makes sense, but why is it appearing in the bulk-deconvolution graph outputs? We never even labelled TSD anywhere, so it must be within the code to automate that. Was that 'sample disease group', in which case, is it just doing the phenotype target threshold? Or is it taking into account the scRNA TSD vs non-TSD data? What if I wanted the triangles to be male vs female in a dataset, for instance? Would I type 'male' into 'Sample Disease Group'?

mtekman commented 2 years ago

That is HILARIOUS! Ok, but also, I am using the data linked in the tutorial - Is this the wrong dataset?

Ah I see, no it's correct I just apparently never tested it

Yes, but why beta cells? There are 6 different cell types, how was 'beta' specified? What if I wanted to know about the alpha cell types? In fact it's weird it doesn't spit out all the cell types, quite frankly.

Ah... Hmm. Yes. Good question....

I'm looking through the code now, and I genuinely don't know.... it could just be that the title is misleading, and that it's showing UserSetFactorX vs UserSetFactorY and I've misinterpreted it.... Hmmm.... I'm not sure and will have to dig a bit

Don't hate me, but it would be very valuable to do both...

Lol no, I've already torn the code apart once, I can do it again, and yes I can see how it would be more practical.

So that makes sense, but why is it appearing in the bulk-deconvolution graph outputs? We never even labelled TSD anywhere, so it must be within the code to automate that. Was that 'sample disease group', in which case, is it just doing the phenotype target threshold?

Exactly it's doing phenotype target threshold and determining what is healthy and what is not.

Or is it taking into account the scRNA TSD vs non-TSD data?

This I don't know....

What if I wanted the triangles to be male vs female in a dataset, for instance? Would I type 'male' into 'Sample Disease Group'?

Yes, and give it instead of hba1c as the disease factor, but a column maybe named gender in the phenotype data. (Though I should change the label "Disease" to something less accusatory).

I'm actually presenting this tomorrow morning in a lab meeting, and I think I'll take a screenshot of this convo as a prime example of how the revision process works, and what seems completely normal to the devs is not bounded in reality by the scientists :laughing:

nomadscientist commented 2 years ago

Nice! Tell them I'm a useful member of society too

mtekman commented 2 years ago

You shall be introduced as the ranting lunatic I met on the web :P

nomadscientist commented 2 years ago

Woo! Awesome, ping me when these bits are sorted and then I'll finish testing the tutorial and I think I can even add myself as a reviewer on Github, how exciting!

mtekman commented 2 years ago

Will do! ETA: sometime towards the end of next week

nomadscientist commented 2 years ago

The problem we're likely to run into is that the this MuSiC tool appears to be a specific RNA deconvolution tool to help understand a specific phenotype, and we're trying to generalise it as much as we can.

I'm not sure how easy this will be, and it took a lot of head-scratching to detangle it from the hba1c phenotype target. Future users might just need to create some kind of disease factor in their metadata for it to work (I realise how dumb that sounds)

You're bang on with this, and I think this is also just underlying a general problem of 'Wow isn't my tool awesome' "But it needs major rewrites to work for anyone else?" 'That's fine, they can figure it out..."

Maybe the scale of rewrites is good, because then this counts as even more of an output for you and can generate even more impact. MMusic :)

mtekman commented 2 years ago

@nomadscientist I rebased everything just to keep this PR up to date with the main branch, since it's a long WIP.

All your edits and commits are still there, but if you make any future edits -- please pull everything first

bgruening commented 2 years ago

@mtekman @nomadscientist what an epic work, very cool. Not sure if you are aware, but we can get this merged but hidden by default. This way you can already use a link to point to the tutorial, but people browsing the GTN would not see it. Maybe that is useful for you here.

This one for example is such a tutorial: https://training.galaxyproject.org/training-material/topics/assembly/tutorials/vgp_genome_assembly/tutorial.html

mtekman commented 2 years ago

@bgruening nice - that looks optimal for us, since we're including it in the GTN and will need to link to it soon. How can we get this merged and hidden?

@nomadscientist - so I've updated the tools and trainings, but because I re-based it's hard to see exactly what I changed unless you click on each and every commit after yours(!) Instead, I'm hosting both the tools and the training on a public server so you can go through it without having to reset and rebase your local history (I've shared the link in the whatsapp).

Things that have changed:

Added a fixed y-axis option to the scatter plots
Made the "disease" stuff now optional for the deconvolution. People can now just supply a bulk set and a scRNA set and hit execute and it should just work without any other fiddling.
Fixed the "where is the beta cell specified?" issue you mentioned before - which I previously hand-waved away with some nonsense about factors and levels due to me being misled by the code - but after looking at it more deeply code, yes you should definitely be able to specify exactly which cell type you want to explore when measuring a bulk disease phenotype.

Hopefully these changes are reflected both in the tools and the updated training text. Please can you run through it again and see if it makes sense still?

bgruening commented 2 years ago

enable: false is your friend, see https://github.com/galaxyproject/training-material/blob/main/topics/assembly/tutorials/vgp_genome_assembly/tutorial.md

nomadscientist commented 2 years ago

Wait, how can I edit it now???

mtekman commented 2 years ago

@nomadscientist Wew, back at a computer. So the live site is at that website I shared, but to review it you need to do inline comments on Github here:

Go here: https://github.com/galaxyproject/training-material/pull/2790/files
Scroll down to the file: "topics/transcriptomics/tutorials/bulk-music/tutorial.md"
(It might say "Load Diff" and show nothing for the file. Click on the "Load Diff" to load the whole file
Every line in that file will start with "+" symbol, which you can click on to add a comment.
If the comment is part of a larger review, then start a review, then add comments, and when finished submit the review with a judgement ("accepted" "needs changes" etc.)

nomadscientist commented 2 years ago

1) OK, I think there are a few (very minor) text things that I can fix when this is merged - they in no way take away from the learning experience or quality of the tutorial, so I think it should just be merged! 2) I LOVE how you're rejigged the Music tool. It's much clearer and can now be adapted in a heap of ways, very cool. 3) The drag & drop is no longer necessary, yay! I hope this works all the time and not just on the Galaxy you shared... 4) For the scRNA T2D - is this comparing the expression in the T2D cells with the bulk, or is it just saying anything above 6.5 we're counting as T2D? I'm not sure how the relationship works between these metadata and how it analyses the cell types. OH or is it saying that it identified bulk samples that likely correspond to the T2D samples, AND AS WELL plotted this along hba1c, and shows a direct correlation, so therefore clearly the deconvolution is working well at identifying 'likely disease'? Because if that's how it works, THAT is cool. 5) If somebody had a bunch of samples from a knockout and a wildtype and wanted to compare cell proportions in the two samples, how would they do this? Would they select this as the phenotype factor? I get that it would be calculated in the log of MuSiC fitting, but I'm trying to figure out how they'd display this. 5b) Actually yeah, we need to be able to distinguish this in the jitter plots side-by-side 6) Can we input multiple into the 'scRNA phenotype cell target'? i.e. beta,gamma (I tried this and we can't, but damn it would be helpful) 7) There should be a toggle to tick 'DO NOT SHOW NNLS' because it's unnecessary for use. Much more useful would be for the plot to show wildtype where MuSiC is and (knockout or similar group) where NNLS is. [I know this is a big change :( 5b) 8) Other than that, I finished the tutorial and it's awesome! @mtekman

mtekman commented 2 years ago

Edit: Formatting

For the scRNA T2D - is this comparing the expression in the T2D cells with the bulk, or is it just saying anything above 6.5 we're counting as T2D?

So the 6.5 threshold is measured against the bulk RNA dataset which contains this hba1c factor (see Dataset 2 in history), and this bulk dataset has 89 samples, all of which are plotted in the graph (Figure 8).

MuSiC, using scRNA data to work out cell proportions, then tells us in that graph what the percentage of beta cells is for each of the 89 bulk RNA samples, and plots this value against the hba1c factor.

The expectation from the literature/caption is that we see a decrease in beta cells as the hba1c bulk RNA factor increases. Before using MuSiC, we had no idea what the proportion of beta cells in that sample was, but now we do!

OH or is it saying that it identified bulk samples that likely correspond to the T2D samples

So that is shown in Figure 9 with the heatmap, with RNA samples as rows and scRNA-derived cell types as columns.

I realise that if someone asked "what is the cell composition of my bulk data?" a pie chart might be easier visualize.... but as we see in that heatmap, there are a lot (89) of samples in this case and so it's not clear whether we need to generate 89 seperate pie charts or not.

AND AS WELL plotted this along hba1c, and shows a direct correlation, so therefore clearly the deconvolution is working well at identifying 'likely disease'? Because if that's how it works, THAT is cool.

Yes and Yes :D

If somebody had a bunch of samples from a knockout and a wildtype and wanted to compare cell proportions in the two samples, how would they do this?

I believe they would run the program twice, both times using the same scRNA dataset (to have the same cell profiling) but different bulk RNA dataset samples.

Would they select this as the phenotype factor? I get that it would be calculated in the log of MuSiC fitting.

There would be no need to set a phenotype factor, since a disease would not be the thing being compared there. Under, "Show proportions of a disease factor?", "No" should be selected

Screenshot 2022-02-01 at 13-50-10 Galaxy Configured by Planemo

but I'm trying to figure out how they'd display this.

They'd literally compare two separate heatmaps me thinks. If they have mutant and WT in the same bulk RNA dataset, then they need to split this dataset into two.

Can we input multiple into the 'scRNA phenotype cell target'? i.e. beta,gamma (I tried this and we can't, but damn it would be helpful)

I believe so, it would take a few days though to implement.

There should be a toggle to tick 'DO NOT SHOW NNLS' because it's unnecessary for use.

This I can easily remove, but I keep it just so that users can validate their cell type proportions (via MuSiC) with another older method (NNLS). It's part of the original training, so I figured I should not change it. What do you think? Give users the option?

Much more useful would be for the plot to show wildtype where MuSiC is and (knockout or similar group) where NNLS is. [I know this is a big change :( 5b)

So I know how I would go about doing this, but this will definitely be tricky to implement for two reasons:

The plots will definitely require some TLC to get right
It would deviate heavily from the original training material "vignette" that this training was based on. We would have to write a completely new one for the use case of comparing mutant and wildtype. And maybe we should?

If so, we would use the Mutant vs WT training first, and then lead them onto this training where we look at how a disease factor that correlated with a single cell type can be seen in bulk Data.

Also, we're running out of time!

mtekman commented 2 years ago

More thoughts:

If we were to do a Mutant vs WildType comparison, maybe I should make another tool called "MuSiC Compare" which takes three (or four) files as input: WT bulk, Mutant bulk, WT scRNA, (and optionally a Mutant scRNA).
- The output of the tool would just be two heatmaps side by side, and two pie charts which summarize the cell compositions. Users can then use the main MuSiC tool if they want to probe disease phenotypes further.
However, the above tool limits the analysis only to a pairwise comparison. If there are multiple condition groups and we wish to see the cell composition of each group?
- It gets tricky, do we do yet another tool for this use case or do we try to merge it with the above tool that takes 4 inputs?

I'm not complaining here btw, I really think we could flesh this thing out. We would just need a new training for it, and that likely won't be presented in the GTN

galaxyproject / training-material

MuSiC Tutorial #2790