RAPLER / dst-1

Combine basic probability assignments with Dempster's rule of combination
6 stars 3 forks source link

Feature request #11

Closed garyzhubc closed 1 year ago

garyzhubc commented 1 year ago

Better than truncation: merge subsets with small bca into one bigger subset.

Reason: conflict occur quickly without normalizing after every iteration. Not sure if it's because of numerical instability or actual conflict. Need much more testing than it is already been done.

RAPLER commented 1 year ago

I reread the code of the "bcaTrunc" function. The disjunction of small mass subsets is what I effectively programmed, as you can see with the small example given in the help page of the function. Seems I had forgot this detail!

garyzhubc commented 1 year ago

I see. If this is the case then conflict is what's causing the problem. Maybe I could look into whether there's a conflict-less way of constructing beliefs?

garyzhubc commented 1 year ago

I printed a list of iteration number and conflict so as you can see conflict start to increase after 27-th belief that's combined.

[1] 2 [1] 0 [1] 3 [1] 0 [1] 4 [1] 0 [1] 5 [1] 0 [1] 6 [1] 0 [1] 7 [1] 0 [1] 8 [1] 0 [1] 9 [1] 0 [1] 10 [1] 0 [1] 11 [1] 0 [1] 12 [1] 0 [1] 13 [1] 0 [1] 14 [1] 0 [1] 15 [1] 0 [1] 16 [1] 0 [1] 17 [1] 0 [1] 18 [1] 0 [1] 19 [1] 0 [1] 20 [1] 0 [1] 21 [1] 0 [1] 22 [1] 0 [1] 23 [1] 0 [1] 24 [1] 0 [1] 25 [1] 0 [1] 26 [1] 0 [1] 27 [1] 0.7680963 [1] 28 [1] 0.9462207 [1] 29 [1] 0.9875284 [1] 30 [1] 0.9971652 [1] 31 [1] 0.9997388 [1] 32 [1] 0.9999759 [1] 33 [1] 0.9999978 [1] 34 [1] 0.9999998 [1] 35 [1] 1

garyzhubc commented 1 year ago

Right now I'm trying "skip combination if there's a conflict" and I'll see if it works.

RAPLER commented 1 year ago

Here is a simple example where conflict rise rapidly A <- bca(tt=matrix(c(1,0,1,1), ncol = 2, byrow = TRUE), m = c(0.9, 0.1), cnames = c("yes", "no"), idvar=1, varnames = "x" ) bcaPrint(A) NotA <- bca(tt=matrix(c(0,1,1,1), ncol = 2, byrow = TRUE), m = c(0.9, 0.1), cnames = c("yes", "no"), idvar=1, varnames = "x" ) bcaPrint(NotA) z <- dsrwon(A, NotA) cat("con = :", z$con) n <- 10 for (i in 1:n ) { z <- dsrwon(A,z) cat("\ ") cat("i = :" ,i, "con = :", z$con) }

The result: i = : 1 con = : 0.97929 i = : 2 con = : 0.9979104 i = : 3 con = : 0.9997908 i = : 4 con = : 0.9999791 i = : 5 con = : 0.9999979 i = : 6 con = : 0.9999998 i = : 7 con = : 1 i = : 8 con = : 1 i = : 9 con = : 1 i = : 10 con = : 1

garyzhubc commented 1 year ago

image

[1] "conflict 0" [1] "436-th skip" [1] "iteration 1564" [1] "conflict 0" [1] "437-th skip" [1] "iteration 1565" [1] "conflict 0" [1] "438-th skip" Error in ztgo[, 1:ncol(x$tt)] : incorrect number of dimensions Calls: bcaTrunc Execution halted

Still a lot of problems being applied to large datasets.

RAPLER commented 1 year ago

Bug presumably because there was only one subset with mass less than treshold value. I have added checks to handle this case.

RAPLER commented 1 year ago

Seems to be a good idea to skip the contradictory cases. Maybe combine them together, then combine the two final results?

garyzhubc commented 1 year ago

Now the code is working fine by skipping. However I still cannot get the kind of result that I'm expecting to see. Due to known research, the middle row is expected to have either higher beliefs or plausibility.

image

RAPLER commented 1 year ago

Were these three subsets added with the addTobca function? If so, that means there is no subset of the last two in your resulting bca. Is this possible?

garyzhubc commented 1 year ago

No. I would expect the first and third to be close to nothing and only the second shows up something.Sent from my iPhoneOn Jun 11, 2023, at 6:19 PM, Claude Boivin @.***> wrote: Were these three subsets added with the addTobca function? If so, that means there is no subset of the last two in your resulting bca. Is this possible?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

RAPLER commented 1 year ago

Do you normalize at every step or only at the end?

garyzhubc commented 1 year ago

Only at the end.Sent from my iPhoneOn Jun 12, 2023, at 6:45 AM, Claude Boivin @.***> wrote: Do you normalize at every step or only at the end?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: @.***>

RAPLER commented 1 year ago

Bugs corrections 1) In the nzdsr function: I deleted the test with the measure of conflict (lines 37-38 of nzdsr. This indice does not have to intervene in the combination by Dempster's rule. This is only a decision aid in the analysis conflicting evidence.

2) In the dsrwon function: I removed the measure of conflict as a warning of conflicting evidence. Not necessary.

I made changes to the computation of the measure of conflic. . The result is now the same whether we normalize only at the end of many calls to dsrwon or normalize after each call to dsrwon.

As it is now, the code is not clean, but with these changes you should be able to process your data file without skipping records.

Good luck.

garyzhubc commented 1 year ago

The new version doesn't give any conflict. That's a little strange.

[1] "iteration 13072" [1] "conflict 0" [1] "iteration 13073" [1] "conflict 0" [1] "iteration 13074" [1] "conflict 0" [1] "iteration 13075" [1] "conflict 0" [1] "iteration 13076" [1] "conflict 0" [1] "iteration 13077" [1] "conflict 0" [1] "iteration 13078" [1] "conflict 0" [1] "iteration 13079" [1] "conflict 0" [1] "iteration 13080" [1] "conflict 0" [1] "iteration 13081"

garyzhubc commented 1 year ago

That's very strange. Everything has zero conflict. Is it possible? Before it has a lot of conflict increasing gradually until it reaches one.

RAPLER commented 1 year ago

Indeed, it is strange. I did some tests with conflicting data on a small loop. I obtain a contradiction index at each step of the loop. I suggest you retain or print the sum of the first row of the "tt" matrix at each step, in addition to the "$con" value. If the value $con is zero, the sum of the first line must NOT be zero.

RAPLER commented 1 year ago

Also maybe it has to do with truncation, if you use it. I'll look into this tomorrow, see if I have taken into account the case where the empty set is present with a mass less than the threshold value.

garyzhubc commented 1 year ago

Here's how I did this:

# load cache
library(bigsnpr)

pcsk9_updown_bigsnp_obj <- snp_attach("../data/ukb22828_c1_b0_v3_pcsk9_updown.rds")
ukb_data_ldl <- readRDS("../data/ukb_data_ldl.rds")

library(dst)

# initialize loop
N <- nrow(pcsk9_updown_bigsnp_obj$genotypes)
M <- ncol(pcsk9_updown_bigsnp_obj$genotypes)
ukb_data_ldl$ldl_highlow[1:30]
# check index
# if high, use <= 1
#a <- 1e-5
a <- 0.9
trunc_thres_upper <- 1
trunc_thres_lower <- 0.1

bpa <- bca(rbind(as.integer(pcsk9_updown_bigsnp_obj$genotypes[ukb_data_ldl[1, 4], ] <= 2-trunc_thres_upper), rep(1, M)), c(a, 1-a), pcsk9_updown_bigsnp_obj$map$rsid)

# rowwise Dempster's rule of combination
K <- 30
nskip <- 0
#K <- nrow(ukb_data_ldl)
for (i in 2:K) {
  print(paste("iteration",i))
  if (ukb_data_ldl$ldl_highlow[i] == FALSE) {
    bpa_new <- bca(rbind(as.integer(pcsk9_updown_bigsnp_obj$genotypes[ukb_data_ldl[i, 4], ] >= trunc_thres_lower), rep(1, M)), c(a, 1-a), pcsk9_updown_bigsnp_obj$map$rsid)
    bpa_new <- dsrwon(bpa, bpa_new, mcores = "yes")
  } else if (ukb_data_ldl$ldl_highlow[i] == TRUE) {
    bpa_new <- bca(rbind(as.integer(pcsk9_updown_bigsnp_obj$genotypes[ukb_data_ldl[i, 4], ] <= 2-trunc_thres_upper), rep(1, M)), c(a, 1-a), pcsk9_updown_bigsnp_obj$map$rsid)
    bpa_new <- dsrwon(bpa, bpa_new, mcores = "yes")
  }
  print(paste("sum(bpa$tt)",sum(bpa$tt[1,])))
  print(paste("conflict",bpa$con))
  #if(bpa_new$con == 0) {
  #  bpa <- bpa_new
  #} else {
  #  nskip <- nskip + 1
  #  print(paste0(nskip,"-th skip"))
  #}
  if(i %% 5 == 0) {
    bpa <- bcaTrunc(bpa, 0.001)
  }
  saveRDS(bpa, "../data/bpa.rds")
}
garyzhubc commented 1 year ago
[1] "sum(bpa$tt) 172"
[1] "conflict 0"
[1] "iteration 635"
[1] "sum(bpa$tt) 172"
[1] "conflict 0"
[1] "iteration 636"
[1] "sum(bpa$tt) 172"
[1] "conflict 0"
[1] "iteration 637"
[1] "sum(bpa$tt) 172"
[1] "conflict 0"
[1] "iteration 638"
[1] "sum(bpa$tt) 172"
[1] "conflict 0"
[1] "iteration 639"
[1] "sum(bpa$tt) 172"
[1] "conflict 0"
[1] "iteration 640"
[1] "sum(bpa$tt) 172"
[1] "conflict 0"
[1] "iteration 641"
[1] "sum(bpa$tt) 172"
[1] "conflict 0"
[1] "iteration 642"

So not in the way you suggested.

garyzhubc commented 1 year ago

And yes, I use truncation.

garyzhubc commented 1 year ago

If I don't use truncation, the combination progresses interestingly much faster than before, without any conflict. I think something is wrong.

RAPLER commented 1 year ago

Could you do a test with mcores = "no" on a few cases, to see if we have the same problem?

garyzhubc commented 1 year ago

I found some issues with my code, so no problem yet seen on the update. Now I'm able to run this successfully on large number of patients.

RAPLER commented 1 year ago

Glad to know things are starting to work better. I hope the results will meet your expectations.

Le jeu. 15 juin 2023 à 20:25, Peiyuan Zhu @.***> a écrit :

Closed #11 https://github.com/RAPLER/dst-1/issues/11 as completed.

— Reply to this email directly, view it on GitHub https://github.com/RAPLER/dst-1/issues/11#event-9545910957, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA5D5IP5AAHFXGZJE4EQMTDXLORXPANCNFSM6AAAAAAYY7A35Y . You are receiving this because you commented.Message ID: @.***>