Closed dkopasker closed 1 year ago
@dkopasker I'm working on it, no result so far. Do you want me to prioritize this?
@dkopasker OK, I have a bit more information now.
I feel like you should make the columns wider. I failed to find any record with the value 0.54180604
and I assume you see truncated versions of actual numbers. Nothing happened to the data itself, I observe exactly the same behaviour when I use LibreOffice. There is a change Excel rounds off the values somehow, but I'm not an expert and can say for sure. Using R should be safer.
Another problem with this data is that it has little to no variance. Well, that's a very bold statement, but still, out_ghqcase_baseline
in grp_age25
in 2020 has the following statistics:
Mean 0.522342406859302
Standard Error 0.000316238516741248
Mode 0.518830885560564
Median 0.522029954922205
First Quartile 0.515561372891216
Third Quartile 0.528864330376618
Variance 0.000100006799470704
Standard Deviation 0.0100003399677563
Kurtosis -0.0989715777778066
Skewness 0.0915798025087873
Range 0.065516780649057
Minimum 0.493383742911153
Maximum 0.55890052356021
Sum 522.342406859302
Count 1000
This increases the number of total values that look identical when truncated. I could build another table for you with these metrics for each subgroup if you think that's a good idea.
If that deviates greatly from your expectations just let me know and I'll investigate the issue even further.
Hi @vkhodygo,
To be more specific, runs 797
and 803
produce identical values for the variable out_ghqcase_baseline
when time == 2020
and grp_age25 == TRUE
. This is just one example from many. The reason this is a problem is because your R code gives a tied rank. In this case, the tie is a rank_ghqcase_baseline
of 974.5
. This means the upper level for our confidence interval, rank 975
, does not appear for this group. You will find many tied ranks around 25
, 500
, and 975
.
The tied rank is being identified using the original csv file in R. I am using a csv of the output generated by your R file. Perhaps truncation has occurred at some point in generating that file?
Statistics using the file eff.csv are very similar to what you report above. One exception is that 1,001 observations exist. Perhaps this should be a separate issue?
To be more specific, runs
797
and803
produce identical values for the variableout_ghqcase_baseline
whentime == 2020
andgrp_age25 == TRUE
That is a completely different story, only identical inputs can produce identical outputs. The chances of a collision here are negligible small, but we should not exclude this possibility also. Thanks for the clarification, I'm already working on that.
Just some funny numbers: out of 126126
values in the out_ghqcase_baseline
column only 111450
are actually unique.
@dkopasker
One exception is that 1,001 observations exist.
That's because we generate metrics for every run individually and for all of them at once. The latter should not be included into the resulting statistics.
Perhaps this should be a separate issue?
Yes, but only if you are not sure these numbers are valid, i.e., you expected more spread or something like that.
@dkopasker
I'm afraid you'll have to make do with this data until we find a way to scale the simulations. Please, note that everything that's discussed below is related to out_ghq_baseline
where run == 797
, run == 803
and year == 2020
. However, it seems to be applicable to other pairs of run, year
as well.
When you extract data for a given run
and for a given year
from baseline.csv
and filter it to have people aged [25, 45)
you end up with 6877
values in total.
Switching to the discrete representation that is 1
when out_ghq_baseline <= 24
and 0
otherwise results in a variety of possible combinations of values (10001..., 10010...,...
etc.). However, and this moment is very important, the group mean can only have no more than 6877
distinctive values since any mean is permutation invariant as a consequence of commutativity of a sum. Thus, you have duplicates here.
You might have noticed that values in other columns are not the same, for example, out_ghq_baseline = 23.4037282157896
when run == 797
and out_ghq_baseline =23.3710022511753
when run == 803
. This is not a solid proof, but my explanation seems to be plausible.
This does seem plausible but we can have a stronger proof: we extract only these two runs from the output data. They should have identical counts of positive cases in this year, but the case IDs should be different.
In future, we can reduce the number of ties by increasing the sample size.
@dkopasker
Here you go, the data extracted from the baseline
scenario:
797_clear.csv
803_clear.csv
The counts are the same, but the id
s and dhm
values are different.
@dkopasker
Do you think we can close this issue?
Hi @vkhodygo,
The R code used to aggregate the results from 1,000 runs of the simulation has an unusual amount of ties up to the eighth decimal place. For example, two observations for
out_ghqcase_baseline
ingrp_age25
in 2020 have a value of 0.54180604. This happens multiple times across various outcomes, groups, and years. Could you please review the code to ensure there is not an error?