statistical test of left censorship

rmflight commented 6 months ago

We can put a statistical test whether missingness is due to left censorship (we think) using this idea:

for any feature that has missing values in one or more samples
take the remaining non-missing values in a group
calculate whether they are over or under the median of the non-missing values
for all missing features, combine into a vector of 1's and 0's (success and fail)
do a binomial test of successes / failures

rmflight commented 6 months ago

OK, so something I'm not understanding, or maybe the idea is a bit flawed.

There is a problem here. By definition, the median is the middle of the remaining distribution. So on average, we actually do expect that the remaining non-missing values are 50 / 50 over and under the median of the non-missing values.

For example, for the Yeast dataset, I see this binomial result:

number of successes = 19017, number of trials = 38760, p-value = 0.0002308

95 percent confidence interval:
 0.4856458 0.4956249

The only reason the result is not something else is just because of the huge number of values we are aggregating over. I need to test, but I suspect if I took smaller subsets of the success / failure vector, that we wouldn't be statistically significant.

The code I'm using for this test is here: https://github.com/MoseleyBioinformaticsLab/visualizationQualityControl/blob/31-statistical-test-of-left-censorship/R/left_censorship.R#L13

Just a reminder, here is the plot of the yeast dataset for number of present vs median minimum value: Figure_2-lod-1

hunter-moseley commented 6 months ago

Two caveats here: 1) The test needs to be one-sided. If successes is the number of non-missing values below the median, then the successes must be above a 0.5 fraction.

2) We can add an odds-ratio cutoff to ensure a certain level of left censorship is present.

On Mon, Apr 8, 2024 at 9:46 AM Robert M Flight @.***> wrote:

OK, so something I'm not understanding, or maybe the idea is a bit flawed.

There is a problem here. By definition, the median is the middle of the remaining distribution. So on average, we actually do expect that the remaining non-missing values are 50 / 50 over and under the median of the non-missing values.

For example, for the Yeast dataset, I see this binomial result:

number of successes = 19017, number of trials = 38760, p-value = 0.0002308

95 percent confidence interval: 0.4856458 0.4956249

The only reason the result is not something else is just because of the huge number of values we are aggregating over. I need to test, but I suspect if I took smaller subsets of the success / failure vector, that we wouldn't be statistically significant.

The code I'm using for this test is here: https://github.com/MoseleyBioinformaticsLab/visualizationQualityControl/blob/31-statistical-test-of-left-censorship/R/left_censorship.R#L13

Just a reminder, here is the plot of the yeast dataset for number of present vs median minimum value: Figure_2-lod-1.png (view on web) https://github.com/MoseleyBioinformaticsLab/visualizationQualityControl/assets/1509626/c2c259aa-fd79-4fa1-b787-e14b54b04eaa

— Reply to this email directly, view it on GitHub https://github.com/MoseleyBioinformaticsLab/visualizationQualityControl/issues/31#issuecomment-2042804088, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEP7B3D53KNVS3XEIQAY33Y4KNR7AVCNFSM6AAAAABFO2TW2WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBSHAYDIMBYHA . You are receiving this because you are subscribed to this thread.Message ID: <MoseleyBioinformaticsLab/visualizationQualityControl/issues/31/2042804088 @github.com>

-- Hunter Moseley, Ph.D. -- Univ. of Kentucky Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center / Institute for Biomedical Informatics / UK Superfund Research Center Not just a scientist, but a fencer as well. My foil is sharp, but my mind sharper still.

Email: @. (work) @. (personal) Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax) Web: http://bioinformatics.cesb.uky.edu/ Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093

rmflight commented 6 months ago

We ended up moving this over to the ICIKendallTau package, so guess we don't need it here anymore ...

MoseleyBioinformaticsLab / visualizationQualityControl

statistical test of left censorship #31

-- Hunter Moseley, Ph.D. -- Univ. of Kentucky Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center / Institute for Biomedical Informatics / UK Superfund Research Center Not just a scientist, but a fencer as well. My foil is sharp, but my mind sharper still.