VEuPathDB / web-monorepo

A monorepo that contains all frontend code for VEuPathDB websites
Apache License 2.0
2 stars 0 forks source link

Maaslin method for diff abund #526

Closed d-callan closed 8 months ago

d-callan commented 8 months ago

this turned into more than i thought it would be... the idea was to add a new input for the diff abund compute, to support choosing a method explicitly. backend that meant what i thought was a minor api change, which was something less minor here than i had assumed. so ive gone and basically added more explicit support for variable collections as well.

Still need to make the new method input prettier and have the available method differ by the collection chosen. im considering leaving the available methods unconstrained for the time being, until some backend work is done to better support that.

d-callan commented 8 months ago

oh. i also need to figure yet what to do about saved analyses and this api change breaking them.

asizemore commented 8 months ago

If the collection work is in separate commits from the volcano work, it'd be helpful to break them into separate PRs. One i can't think about two things at once lol and two Dave is off this week and will be able to provide better feedback on the collection changes than i can.

For the maaslin changes, i didn't test them just read through. In general i think they look good, but we should have the visualization split the data from the effectSizeLabel. That way the volcano component is getting just general volcano data in the data prop, and the visualization handles the specifics of our backend. Specifically, i think VolcanoPlot could gain an effectSizeLabel prop, and then VolcanoPlotData can just be the array of points again. In the visualization, we can snatch the effectSizeLabel off of the response and not worry about it during the data cleaning or anything.

d-callan commented 8 months ago

when i first started looking at this things spiraled pretty quickly, and im not sure the collections stuff actually is in separate commits or trivial to separate at this point. thats on me :( but the collections stuff is also required to test the volcano changes, and i dont really want to wait a week for that if im real honest.. maybe @jernestmyers could take a look if hes around?

i have mixed feelings about splitting the label for the stats from the data, if im honest. i get where youre coming from............... i do. but idk, i dont even like that the adjusted p values dont have explicit labels attached to them.. theres more than one way to adjust p values. it doesnt feel like a coincidence or a technical detail of how the backend does its work to me. is there any other valid place to get a label for a statistic, other than the same place you got the statistic?

asizemore commented 8 months ago

That's true it is nice to have the label right there with the data. But I'd argue that even if a volcano plot showed a general "effect size" label on the x axis, that it would still be a valid volcano plot. The visualization plugin in my head holds all the context, and the component takes data that is as basic and general as possible. For example, what if genomics wanted a volcano plot but it's in a place that always uses deseq and the genomics folks call that effect size something gene-specific? Like log2(fold expression change) or something? We'd have a genomics-specific visualization that could hold all of that context while the component can stay light.

d-callan commented 8 months ago

so i dont think id be ok w a volcano plot that wasnt explicit about what the stats actually represented. if i saw 'effect size' as an axis label, id not trust that without knowing more. though i recognize im maybe an outlier. as for where the label comes from and whether it depends on context.. idk to me its the choice of deseq that determines the label, not a genomics vs mbio context. i can appreciate that mbio folks might like 'Fold Change' and genomics might like 'FC' or something, but the statistic from deseq will always be fold change regardless and im uncomfortable logically separating that label from the values. its too easy to write 'Fold Change' into some viz, then add some method where thats not accurate or appropriate, and miss it. its the statistics + its label that constitutes data.

asizemore commented 8 months ago

if i saw 'effect size' as an axis label, id not trust that without knowing more

I agree! But the visualization plugin is the context that gives us the information. If the plot said "effect size" as the label but the whole page was talking about the log fold change as the effect size, then it'd be clear, no?

i can appreciate that mbio folks might like 'Fold Change' and genomics might like 'FC' or something, but the statistic from deseq will always be fold change regardless

Right exactly i agree! But we shouldn't put all the genomics and mbio logic into VolcanoPlot, that all is supposed to live in VolcanoPlotVisualization. So then we could have the effectSize as part of data like in the PR and have an optional prop that gives a label for the effect size if we don't want to use whatever comes with the data. Another option is to have the data be just an array of data points and require an effect size label prop. That way anyone using the component has to think about it at least.

Both solve the problem, but i'm in favor of the latter because it makes the component easier to use when the inputs are more basic objects, and also easier to refactor if we wanted to change out the volcano component in the future (we have this problem with scatter where there's a ton of processing of the data to make it perfect for plotly but that causes a headache). @d-callan would this approach quell your concerns?

d-callan commented 8 months ago

im not willing to assume that these plots will always be used in the context of some eda viz plugin and believe that what the plot components provide should be able to stand on their own. im not too worried about differences between mbio and genomics, bc i think theyll be minor things that we could resolve as they come. my point there was more to do w the fact that mbio and genomics actually CANT choose completely different labels, bc the label is inherently tied to the statistic/ analytical method providing the statistic. i know its a slightly involved shape for the data now i guess, but its not bc of some specific lib were using, its bc of the data themselves. if we made a required additional prop for effectSizeLabel, anybody could pass anything as a label, easily and even accidentally..

d-callan commented 8 months ago

id like to merge this, as well as the backend changes to compute, rserve and data, asap. i dont like having that many threads lingering. this is now functional in the sense it does what i set out to make it do. we still need:

  1. to make the ui prettier for the compute config. id like @jernestmyers help w that, and can happen after merge.
  2. to sort out breaking the saved analyses. i should talk again w @dmfalke i think for that when hes back from vacation. again i dont want to wait for that.
  3. constraining the compute config. i also dont think we need to do this before merge, and id like to wait for the backend changes from @dmgaldi .
asizemore commented 8 months ago

Sounds good to me. Also happy to help if needed.