baoxingsong / dCNS

conserved non-coding sequence
MIT License
14 stars 4 forks source link

A problem about the parameters optimize using non-linear regression #25

Closed shangguandong1996 closed 2 years ago

shangguandong1996 commented 2 years ago

Hi, Dr song.

I think the below code in setting parameters manual is using nlsLM to fit the curve produced by ranSco

data = data.frame(x=s, y=s)
for ( i in 1:length(s) ){
    data[i, 2] =  length(which(data1$V1 >= data$x[i])) / nrow(data1)
}
x=data$x
y=data$y
library(minpack.lm)
nonlin_mod=nlsLM(y~(1-exp(-1*k*1000000*exp(-1*l*x))), control=nls.lm.control(maxiter=550), start=list(k=0.3, l=0.2))
plot(x, y, xlab="s", ylab="p-value")
lines(x,predict(nonlin_mod),col="red")
nonlin_mod

But accoring to your equation E = Kmne^(-l*ambdaS) p-value = e^(-E)

so why the y~(1-exp(-1*k*1000000*exp(-1*l*x))) is 1000000(1e6) instead of 1e10(100kb 100kb) ?

Best wishes Guandong Shang

baoxingsong commented 2 years ago

The random scores were generated using the dCNS ranSco command. dCNS ranSco is generating scores for a pair of sequences with 1000bp by default (the -l parameter).

shangguandong1996 commented 2 years ago

Please forgive me if I misunderstand something The true mn is 100kb*100kb, why I can use the l and k paramter predicted by 1000*1000?

baoxingsong commented 2 years ago

l and k are parameters of the distribution, which could be calculated using shorter sequences.

baoxingsong commented 2 years ago

https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html Please read this for more details.

shangguandong1996 commented 2 years ago

Thanks !