VEuPathDB / microbiomeComputations

1 stars 0 forks source link

Correlations handle missing metadata #59

Closed asizemore closed 8 months ago

asizemore commented 10 months ago

Samples may not all have data for assays (not the case for mbio, but it's possible in general), and they definitely dont all have data for each metadata var. For example, in bonus 8 part rep measures are missing data for recumbent length. Currently if there's missing data in var A, then all the correlation coefs for var A vs all-other-vars will be na. We filter all empty edges out in the viz so those vars just get dropped entirely.

Ideally we'd compute the correlation in an elegant way that handles missing data. One idea i had was to then return the number of samples with data (n) along with the correlation coef? That's nice for a table, but not great for the bipartite viz because we don't want to show tooltips for edges (the natural place to put n). But on the other other hand if we show corr results in a heatmap then hovering over a cell would be a good place for n. So yeah maybe sending back n for each link is the best strategy...

d-callan commented 10 months ago

id think for each edge wed want to know:

  1. How many points we have contributing to the correlation coefficient
  2. How many rows were missing data for variable 1
  3. How many rows were missing data for variable 2

then, once we introduce the scatterplot, this data can go in a table near to the scatterplot, similar to how we do for missingness in our non-compute stats chart vizs..

d-callan commented 10 months ago

heres an image from live clinepi as an ex. our scatter could have the number of contributing points in the title, and we could reuse the completeness table back and front end code hopefully as well. ignore the birds eye bar i think.

scatter-missingness

d-callan commented 10 months ago

hmmm. sorry im also realizing for clarity i should probably say that i think whatever request we make for the scatterplot should get this info for free. so im inclined to wait for this ticket until we integrate the scatterplot into the correlation viz.

asizemore commented 10 months ago

To make sure i understand, is your suggestion to not handle NAs until we do the scatterplot work, which would mean we can only compute correlation with metadata that has all values present? Or is your proposal to not worry about sending back all of the extra data about missingness until we do the scatterplot?

d-callan commented 9 months ago

i modified the R to use complete cases to find the correlation coefficients as a first step, but i dont think we want to report how many samples were complete etc yet.

also, for correlation assay assay that may get complicated. its possible for two assays on different entities to be in the same viz together even though theyre on different branches of the dataset diagram. they do however have to have the same parent, and both be 1:1 w that parent. so we could report counts for the parent? not sure whats the least confusing thing.

d-callan commented 8 months ago

is this still a must have?

asizemore commented 8 months ago

I think this was solved by adding the complete cases fn. I'm going to close the issue, and see if any bug bubbles up again that makes us reopen it.