BatadaLab / scID

scID
GNU General Public License v3.0
12 stars 2 forks source link

mclust error #4

Closed adomingues closed 3 years ago

adomingues commented 4 years ago

Hi again,

I was trying to run the scID to identify cluster in a dataset based on a published list of marker genes:

scID_output_zhang <- scid_multiclass(
  target_gem = human_zhong_raw,
  reference_gem = human_zhong_norm,  ## log normalized, seurat 
  markers = zhong_markers,
  normalize_reference = TRUE,
  estimate_weights_from_target = TRUE,
  only_pos = FALSE
)

But I am running into this error, which appears to come from mclust:

|  79%Error in if (sigmasq > signif(.Machine$double.xmax, 6)) { :
  missing value where TRUE/FALSE needed

Any ideas what is going on?

Cheers, António


Package versions:

 mclust          * 5.4.6      2020-04-11 [1] CRAN (R 3.6.3)
 scID            * 2.1        2020-07-14 [1] local
kboufea commented 4 years ago

Hi António,

I haven't come across this issue before. mclust is used in both stage 2 and 3. Do you know in which stage this happens?

Are the data published? Would it be possible to share them with me so I can have a look and try to spot the error? I would only need the target_gem and the list of markers.

Also, just for simplicity, since you are providing a list of markers and you are calculating weights from the target you don't need a reference gem, so could simplify it to

scID_output_zhang <- scid_multiclass(
  target_gem = human_zhong_raw,
  markers = zhong_markers,
  estimate_weights_from_target = TRUE
)

Cheers, Katerina

adomingues commented 4 years ago

Hi Katerina,

First of all thank you for looking into this. I was also very surprised because until now, and I tested scID with a few published datasets, I have not encountered any issues. I also ran the command you suggested and the error persists.

Anyway, this is published data so I have put both objects in an Rdata file which you can find here: https://owncloud.gwdg.de/index.php/s/y1qMLgjDmaH5MOw Let me know if the link doesn't work.

Cheers, António

adomingues commented 4 years ago

Hi Katerina,

I forgot to answer this question:

mclust is used in both stage 2 and 3. Do you know in which stage this happens?

It seems to be at Stage 3. Stage 2 completes without errors:

Stage 2: Estimate weights of signature genes from target                                                                               
    __  ___________    __  _____________                                                                                               
   /  |/  / ____/ /   / / / / ___/_  __/                                                                                               
  / /|_/ / /   / /   / / / /\__ \ / /                                                                                                  
 / /  / / /___/ /___/ /_/ /___/ // /                                                                                                   
/_/  /_/\____/_____/\____//____//_/    version 5.4.6                                                                                   
Type 'citation("mclust")' for citing this R package in publications.                                                                   

Attaching package: ‘mclust’                                                                                                            

The following object is masked from ‘package:purrr’:                                                                                   

    map                                                                                                                                

Progress: 100%  Done!Stage 3.1-2: Calculate scores and find matching cells                                                             
fitting ..
adomingues commented 4 years ago

Quick update @kboufea:

I re-ran scid_multiclass with CPM normalized counts, and this time mclust worked without an issue:

===========| 100%
Done!Stage 3.3: Resolve multiclass assignments

class(scID_output_zhang)
[1] "list"
 head(scID_output_zhang$labels)
                    GW08_PFC1_sc1                     GW08_PFC1_sc2                     GW08_PFC1_sc3                     GW08_PFC1_sc4 
                   "Astrocytes-2"                    "Astrocytes-2"     "Excitatory_neurons_WEEK-GW8"                    "Astrocytes-2" 
                    GW08_PFC1_sc5                     GW08_PFC1_sc6 
"Excitatory_neurons_stage-GW8-12" "Excitatory_neurons_stage-GW8-12" 

This is how I created the CPM matrix:

## use CPM
seo_zhong <- NormalizeData(
  seo_zhong,
  normalization.method = "RC",
  scale.factor = 1e6
)

cpm <- GetAssayData(seo_zhong) %>%
  as.data.frame()
head(cpm)[, 1:5]

In case you are still interested I have re-uploaded the data, this time including the CPM matrix: https://owncloud.gwdg.de/index.php/s/UmQlNyvCzkb2QSr

As for the cause of the issue, basically my mistake because I was using raw counts in the target_gem? That said I have used raw counts for other test data and did not have an issue with scid_multiclass. Could this somehow be related to this comment?

Cheers, António

kboufea commented 4 years ago

Hi António,

Sorry, I didn't have time to look into it yet, but I'm glad you resolved it. target_gem should be indeed normalised. I have included a function counts_to_cpm in this package that is a simple wrapper around scater CPM normalisation function so you can get the normalised target data in a single line of code cpm <- counts_to_cpm(human_zhong_raw)

Lack of variance in gene expression could cause issues with mclust... I saw one of the signatures consists of a single gene and that could be a problem with raw counts... I will look into it to try catch the error for the future though!

Cheers, Katerina