Unusual number of ties - Githubissues

dkopasker commented 1 year ago

Hi @vkhodygo,

The R code used to aggregate the results from 1,000 runs of the simulation has an unusual amount of ties up to the eighth decimal place. For example, two observations for out_ghqcase_baseline in grp_age25 in 2020 have a value of 0.54180604. This happens multiple times across various outcomes, groups, and years. Could you please review the code to ensure there is not an error?

vkhodygo commented 1 year ago

@dkopasker I'm working on it, no result so far. Do you want me to prioritize this?

vkhodygo commented 1 year ago

@dkopasker OK, I have a bit more information now.

I feel like you should make the columns wider. I failed to find any record with the value 0.54180604 and I assume you see truncated versions of actual numbers. Nothing happened to the data itself, I observe exactly the same behaviour when I use LibreOffice. There is a change Excel rounds off the values somehow, but I'm not an expert and can say for sure. Using R should be safer.

Another problem with this data is that it has little to no variance. Well, that's a very bold statement, but still, out_ghqcase_baseline in grp_age25 in 2020 has the following statistics:

Mean                0.522342406859302
Standard Error      0.000316238516741248
Mode                0.518830885560564
Median              0.522029954922205
First Quartile      0.515561372891216
Third Quartile      0.528864330376618
Variance            0.000100006799470704
Standard Deviation  0.0100003399677563
Kurtosis           -0.0989715777778066
Skewness            0.0915798025087873
Range               0.065516780649057
Minimum             0.493383742911153
Maximum             0.55890052356021
Sum                 522.342406859302
Count               1000

This increases the number of total values that look identical when truncated. I could build another table for you with these metrics for each subgroup if you think that's a good idea.

If that deviates greatly from your expectations just let me know and I'll investigate the issue even further.

dkopasker commented 1 year ago

Hi @vkhodygo,

To be more specific, runs 797 and 803 produce identical values for the variable out_ghqcase_baseline when time == 2020 and grp_age25 == TRUE. This is just one example from many. The reason this is a problem is because your R code gives a tied rank. In this case, the tie is a rank_ghqcase_baseline of 974.5. This means the upper level for our confidence interval, rank 975, does not appear for this group. You will find many tied ranks around 25, 500, and 975.

The tied rank is being identified using the original csv file in R. I am using a csv of the output generated by your R file. Perhaps truncation has occurred at some point in generating that file?

dkopasker commented 1 year ago

Statistics using the file eff.csv are very similar to what you report above. One exception is that 1,001 observations exist. Perhaps this should be a separate issue?

vkhodygo commented 1 year ago

To be more specific, runs 797 and 803 produce identical values for the variable out_ghqcase_baseline when time == 2020 and grp_age25 == TRUE

That is a completely different story, only identical inputs can produce identical outputs. The chances of a collision here are negligible small, but we should not exclude this possibility also. Thanks for the clarification, I'm already working on that.

Just some funny numbers: out of 126126 values in the out_ghqcase_baseline column only 111450 are actually unique.

vkhodygo commented 1 year ago

@dkopasker

One exception is that 1,001 observations exist.

That's because we generate metrics for every run individually and for all of them at once. The latter should not be included into the resulting statistics.

Perhaps this should be a separate issue?

Yes, but only if you are not sure these numbers are valid, i.e., you expected more spread or something like that.

vkhodygo commented 1 year ago

@dkopasker

I'm afraid you'll have to make do with this data until we find a way to scale the simulations. Please, note that everything that's discussed below is related to out_ghq_baseline where run == 797, run == 803 and year == 2020. However, it seems to be applicable to other pairs of run, year as well.

When you extract data for a given run and for a given year from baseline.csv and filter it to have people aged [25, 45) you end up with 6877 values in total.

Switching to the discrete representation that is 1 when out_ghq_baseline <= 24 and 0 otherwise results in a variety of possible combinations of values (10001..., 10010...,... etc.). However, and this moment is very important, the group mean can only have no more than 6877 distinctive values since any mean is permutation invariant as a consequence of commutativity of a sum. Thus, you have duplicates here.

You might have noticed that values in other columns are not the same, for example, out_ghq_baseline = 23.4037282157896 when run == 797 and out_ghq_baseline =23.3710022511753 when run == 803. This is not a solid proof, but my explanation seems to be plausible.

797 803

dkopasker commented 1 year ago

This does seem plausible but we can have a stronger proof: we extract only these two runs from the output data. They should have identical counts of positive cases in this year, but the case IDs should be different.

In future, we can reduce the number of ties by increasing the sample size.

vkhodygo commented 1 year ago

@dkopasker Here you go, the data extracted from the baseline scenario: 797_clear.csv 803_clear.csv

The counts are the same, but the ids and dhm values are different.

vkhodygo commented 1 year ago

@dkopasker

Do you think we can close this issue?

MRC-CSO-SPHSU / effect_estimates

Unusual number of ties #8