aertslab / cisTopic

cisTopic: Probabilistic modelling of cis-regulatory topics from single cell epigenomics data
135 stars 29 forks source link

Errors when trying to fit models with less than three topic numbers #44

Closed wkopp closed 2 years ago

wkopp commented 4 years ago

Hi,

selectModel fails when fitting the model with runCGSModels using as the topic argument a single number (e.g. topic=c(30)).

The error message that I get is

Error in `$<-.data.frame`(`*tmp*`, "second_derivative", value = c(-Inf,  : 
  replacement has 2 rows, data has 1
Calls: selectModel -> $<- -> $<-.data.frame

I also get an error when running runCGSModels with only two topic numbers (e.g. topic=c(29,30)), but then the error is different:

Error in plot.window(...) : need finite 'xlim' values
Calls: selectModel ... plot -> plot -> plot.default -> localWindow -> plot.window
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf
Execution halted

When run the model for more then two topic numbers (e.g. topic=c(29,30,31)) it seems to work.

Best, Wolfgang

cbravo93 commented 4 years ago

Hi @wkopp !

Here, the second derivative method is not meant to work with less than three points. In the approach we use (based on the central difference, inspired by https://stackoverflow.com/questions/4471993/compute-the-elbow-for-a-curve-automatically-and-mathematically), the first derivative measures the slope of the line between two points in the likelihood curve (the change between two points), the second measures the difference between two consecutive slopes (or the change of the change, the point with the maximum curvature); so you need at least two slopes (or three points). I have added an error message for this.

Thanks for reporting!

C

wkopp commented 4 years ago

Hi,

would it be possible then to ignore the first and second derivative computation if there are too little numbers of topics? Because, in case the user already knows what number to chose, it requires to nevertheless compute other dummy topic numbers which requires time and resources.

Thank you.

cbravo93 commented 4 years ago

Sure! Just use method='maximum', select='Your number of topics'. Nevertheless, for a proper topic selection I would recommend to run models in a bigger topic space.

Cheers,

C

wkopp commented 4 years ago

I see. However, what is the purpose of having to specify method="maximum" if a specific topic number is selected anyways. In my opinion, it would be better not to have to specify this argument, because for a user this isn't intuitive and also it is not backward compatible, which would be nice and probably possible in this case.

Another issue with the second derivative computation seems to be that it is used by default with runCGSModels, right? However, from the documentation of the type parameter in the selectModel method, you recommend against using the derivative method with collapsed Gibbs sampling. So perhaps the selectModel method should be used in runCGSModels and runWarpLDAModels following the respective recommended model selection criteria.

Best, Wolfgang

cbravo93 commented 4 years ago

Hi!

Sometimes the differences between models are small (in likelihood, or 2nd derivative), so we recommend to select the less complex model. If we only allow the automatic selection, it wouldn't be possible to manually change to other proper models.

selectModel is generally run after running the models (despite they are CGS or WarpLDA), we use derivative as default to agree with the latest WarpLDA estimation. I will add a warning if the models are CGS based.

Cheers!

C