VEuPathDB / microbiomeComputations

1 stars 0 forks source link

How do we setup correlation for handling non-Abundance data? #81

Closed asizemore closed 7 months ago

asizemore commented 7 months ago

Found while testing the wgcna data. There are negative values in the data, which the Abundance class doesn't like.

Worth making a new class for general eda data? The Abudnance class could be a subclass. Then we swap the correlation signaturs to work wiht the eda data class, since at this point we have nothing in there that is related to the data being Abundance data iirc. In the future we can add options for Abundance data specifically....

Other thoughts?

d-callan commented 7 months ago

Can you explain the data to me a bit better? What do the negative values represent? I'm a bit disinclined to making a very vague class unless it's essentially abstract.

I'm also not sure these data should be coming through this package really.. the methods here are sometimes rather specific or optimized for mbio data. It's possible we need to introduce flavors of plugins or something in the compute service and either a new R package or refactor some things out of this package to do this right, though I get that might be something we punt as tech debt for now.

asizemore commented 7 months ago

A new package is a good idea. Agreed maybe for later but worth thinking about. Also for now somethin ggeneral could exist in veupathUtils

The collections are sets of eigengene expressions, which unintuitively can be negative becuase they're really the first principle component of each module. So negative values mean genes in that module were lowly expressed for the given sample.

In general though, our correlation app should be happy taking any reasonable eda data. In this case i was imagining a class that's basically just wahtever data and then it would need the appropriate ids.

asizemore commented 7 months ago

I also wonder if the compute service should get a plugin for non-abundance data or whatever type of data we decide the eigengene collections should be. Seems like it shouldn't care in the general case and the R can figure it out?

d-callan commented 7 months ago

the more specific classes let us do things like know its appropriate to use maaslin for the mbio stuff but not the eigengenes, which is very valuable. i think this error about negative values is a feature, not a bug. im trying to reframe the question from 'how do we get around this error' to 'how do we properly support wgcna data'..

if we choose to leave that question for later just to get something to demo.. it seems like we should do the simplest thing we can that doesnt disturb this package on account of data that probably doesnt belong here. is it crazy to put a hack in the compute plugin for now that checks if we have negative values and rescales the data to above 0 if so? then open a ticket to figure out what to do w these data before this could go into production, w a reminder to remove the hack before then too?

d-callan commented 7 months ago

also ill note some thoughts for the future here so i dont forget, and we can move it to another ticket if needed:

  1. i think the base correlation methods that have signatures like (data.frame, data.frame) etc could move to veupathUtils. the AbundanceData specific ones could stay here and find the generic from veupathUtils.
  2. the compute plugins could learn to parse collection annotations to know when its appropriate to build an AbundanceData obj, vs something else maybe.
  3. The ComputeResult class should probably also be moved to veupathUtils from here. It could maybe also gain some custom constructors so its easier to build one w semi-arbitrary data? that last i want to think about more yet..
asizemore commented 7 months ago

im trying to reframe the question from 'how do we get around this error' to 'how do we properly support wgcna data'..

Yes i agree. I didn't mean to suggest we were doing anything wrong, just intending to highlight it's time to grow into new data types. I'll update the title to hopefully clarify the intention.

I really, really like the reorg idea. Especially realizing that we'll likely need a genomics-focused package. It could be not so far away if folks like the demo. The backup plan of the compute service update is good, but if possible lets do the big reorg we should go for it 🚀

d-callan commented 7 months ago

i think ill probably do the compute plugin hack first, so people can play as needed and hopefully find other issues early. then start a reorg.

d-callan commented 7 months ago

im going to close this as completed w the previously referenced compute service pr. ive opened some issues in more appropriate places relating to long term support of this data:

  1. https://github.com/VEuPathDB/veupathUtils/issues/36
  2. https://github.com/VEuPathDB/service-eda-compute/issues/85

more tickets might get filed as a path forward for the compute service issue becomes more clear.