calculate ranks of things for visualization of missingness

rmflight commented 6 months ago

In addition to the new test of cause of missingness, it might also be really helpful to visualize the missingness patterns across samples using the naniar package (which shows location and percent missingness).

However, this potentially is more powerful if the items are ordered in some way with respect to the median value in each sample. But we can't re-order each sample, or we lose sense of things that are missing in common across samples.

What if we calculate the median rank of the feature across samples, and then reorder them by the median rank, and then visualize them? This should help inform whether ICI-Kt is appropriate, or if something else might be better.

hunter-moseley commented 6 months ago

What are the x and y axes of this plot?

On Thu, Apr 11, 2024 at 9:31 AM Robert M Flight @.***> wrote:

In addition to the new test of cause of missingness, it might also be really helpful to visualize the missingness patterns across samples using the naniar package (which shows location and percent missingness).

However, this potentially is more powerful if the items are ordered in some way with respect to the median value in each sample. But we can't re-order each sample, or we lose sense of things that are missing in common across samples.

What if we calculate the median rank of the feature across samples, and then reorder them by the median rank, and then visualize them? This should help inform whether ICI-Kt is appropriate, or if something else might be better.

— Reply to this email directly, view it on GitHub https://github.com/MoseleyBioinformaticsLab/ICIKendallTau/issues/19, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEP7BY4DHELVQK7CLD72VTY42GENAVCNFSM6AAAAABGCKEWKCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGIZTONZTG4ZTSNQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Hunter Moseley, Ph.D. -- Univ. of Kentucky Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center / Institute for Biomedical Informatics / UK Superfund Research Center Not just a scientist, but a fencer as well. My foil is sharp, but my mind sharper still.

Email: @. (work) @. (personal) Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax) Web: http://bioinformatics.cesb.uky.edu/ Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093

rmflight commented 6 months ago

Sorry Hunter, I should have given an example.

The x axis is the samples, the y is the features, and then colored by whether they are missing or not. It's essentially an overview of all of the values in the dataset.

Here is a fake one I did for the testing-left-censorship vignette in the package. This one I purposely started with ordered data, then added a little bit of noise to the values for replicates, and then introduced the majority of the missingness in the lower order features to force them to be below the median. This basically acts like a visual representation of the missingness in the data. If it's ordered by rank of the feature, then it's almost a visual of the binomial test.

There are only 100 missing values in this example, with 80 of them below the median.

examine-missingness-1

   trials success class
 1   1900    1520     A

 $binomial_test

  Exact binomial test

 data:  total_success and total_trials
 number of successes = 1520, number of trials = 1900, p-value < 2.2e-16
 alternative hypothesis: true probability of success is greater than 0.5
 95 percent confidence interval:
  0.7843033 1.0000000
 sample estimates:
 probability of success 
                    0.8

hunter-moseley commented 6 months ago

Is the example showing features ordered by median normalized rank across samples? The zero should be the lowest rank, which means the x-axis order should be reversed.

On Thu, Apr 11, 2024 at 1:11 PM Robert M Flight @.***> wrote:

Sorry Hunter, I should have given an example.

The x axis is the samples, the y is the features, and then colored by whether they are missing or not. It's essentially an overview of all of the values in the dataset.

Here is a fake one I did for the testing-left-censorship vignette in the package. This one I purposely started with ordered data, then added a little bit of noise to the values for replicates, and then introduced the majority of the missingness in the lower order features to force them to be below the median. This basically acts like a visual representation of the missingness in the data. If it's ordered by rank of the feature, then it's almost a visual of the binomial test.

There are only 100 missing values in this example, with 80 of them below the median.

examine-missingness-1.png (view on web) https://github.com/MoseleyBioinformaticsLab/ICIKendallTau/assets/1509626/8652934e-d5c2-407f-9d61-cf88ebf77389

trials success class 1 1900 1520 A

$binomial_test

Exact binomial test

data: total_success and total_trials number of successes = 1520, number of trials = 1900, p-value < 2.2e-16 alternative hypothesis: true probability of success is greater than 0.5 95 percent confidence interval: 0.7843033 1.0000000 sample estimates: probability of success 0.8

— Reply to this email directly, view it on GitHub https://github.com/MoseleyBioinformaticsLab/ICIKendallTau/issues/19#issuecomment-2050142588, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEP7B5FV6YXECH4IQVIVOTY4275HAVCNFSM6AAAAABGCKEWKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJQGE2DENJYHA . You are receiving this because you commented.Message ID: @.***>

-- Hunter Moseley, Ph.D. -- Univ. of Kentucky Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center / Institute for Biomedical Informatics / UK Superfund Research Center Not just a scientist, but a fencer as well. My foil is sharp, but my mind sharper still.

Email: @. (work) @. (personal) Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax) Web: http://bioinformatics.cesb.uky.edu/ Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093

rmflight commented 6 months ago

OK, after our discussion, and working with a real dataset (yeast from Barton), here are the two missing data plots:

No ordering: yeast_unordered

Rank ordering: yeast_rank_order

hunter-moseley commented 6 months ago

The second graph is just rank order of the features. Correct?

On Thu, Apr 11, 2024 at 7:32 PM Robert M Flight @.***> wrote:

OK, after our discussion, and working with a real dataset (yeast from Barton), here are the two missing data plots:

No ordering: yeast_unordered.png (view on web) https://github.com/MoseleyBioinformaticsLab/ICIKendallTau/assets/1509626/fdc10fc4-6e68-422e-a5a7-1aa599f60b5b

Rank ordering: yeast_rank_order.png (view on web) https://github.com/MoseleyBioinformaticsLab/ICIKendallTau/assets/1509626/93a73fda-c063-4163-af70-d0ea53c86b3d

— Reply to this email directly, view it on GitHub https://github.com/MoseleyBioinformaticsLab/ICIKendallTau/issues/19#issuecomment-2050715320, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEP7B3GYWFLIDUEVJT5IBDY44MQJAVCNFSM6AAAAABGCKEWKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJQG4YTKMZSGA . You are receiving this because you commented.Message ID: @.***>

-- Hunter Moseley, Ph.D. -- Univ. of Kentucky Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center / Institute for Biomedical Informatics / UK Superfund Research Center Not just a scientist, but a fencer as well. My foil is sharp, but my mind sharper still.

Email: @. (work) @. (personal) Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax) Web: http://bioinformatics.cesb.uky.edu/ Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093

rmflight commented 6 months ago

Rank order of features, and then the samples are ordered by percentage missing.

rmflight commented 6 months ago

Final addition to this function, it now also spits out the median rank and number of missing entries for each feature (row), so we can easily create a plot like this one, where we can see that the median rank is directly a function of the number of missing entries! This is for the yeast dataset.

fig-yeast-nna-1

hunter-moseley commented 6 months ago

Very nice graph and functionality!

On Fri, Apr 12, 2024 at 10:37 AM Robert M Flight @.***> wrote:

Final addition to this function, it now also spits out the median rank and number of missing entries for each feature (row), so we can easily create a plot like this one, where we can see that the median rank is directly a function of the number of missing entries! This is for the yeast dataset.

fig-yeast-nna-1.png (view on web) https://github.com/MoseleyBioinformaticsLab/ICIKendallTau/assets/1509626/daf0bb9f-771e-4931-917a-771a5885bb2b

— Reply to this email directly, view it on GitHub https://github.com/MoseleyBioinformaticsLab/ICIKendallTau/issues/19#issuecomment-2051886293, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADEP7B2SZBKHGHTDZAWJTZ3Y47WT3AVCNFSM6AAAAABGCKEWKCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJRHA4DMMRZGM . You are receiving this because you commented.Message ID: @.***>

-- Hunter Moseley, Ph.D. -- Univ. of Kentucky Professor, Dept. of Molec. & Cell. Biochemistry / Markey Cancer Center / Institute for Biomedical Informatics / UK Superfund Research Center Not just a scientist, but a fencer as well. My foil is sharp, but my mind sharper still.

Email: @. (work) @. (personal) Phone: 859-218-2964 (office) 859-218-2965 (lab) 859-257-7715 (fax) Web: http://bioinformatics.cesb.uky.edu/ Address: CC434 Roach Building, 800 Rose Street, Lexington, KY 40536-0093

MoseleyBioinformaticsLab / ICIKendallTau