Error in hclust(parallelDist::parDist(t(CNA_mtx), threads = par_cores, : size cannot be NA nor exceed 65536

Ilarius commented 1 year ago

Hello, if I try to run this in parallel on a cluster with slurm I get a caught bus error , even if I give enough memory.

I tried with just one core but i get the following error:

results <- SCEVAN::multiSampleComparisonClonalCN(listCountMtx, analysisName = "ovarian", organism = "human" , par_cores = 1, plotTree = TRUE)

[1] " raw data - genes: 36601 cells: 71634"
[1] "1) Filter: cells > 200 genes"
[1] "low data quality"
[1] "2) Filter: genes > 5% of cells"
[1] "8286 genes past filtering"
[1] "3) Annotations gene coordinates"

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Loading required package: doParallel
Loading required package: foreach
Loading required package: iterators
Loading required package: parallel
[1] "found 30 confident non malignant cells"
[1] "7537 genes annotated"
[1] "4) Filter: genes involved in the cell cycle"
[1] "7123 genes past filtering "
[1] "5)  Filter: cells > 5genes per chromosome "
[1] "6) Log Freeman Turkey transformation"
[1] "A total of 67300 cells, 7123 genes after preprocessing"
[1] "7) Measuring baselines (confident normal cells)"
[1] "8) Smoothing data"
[1] "9) Segmentation (VegaMC)"
[1] "10) Adjust baseline"
Error in hclust(parallelDist::parDist(t(CNA_mtx), threads = par_cores,  : 
  size cannot be NA nor exceed 65536
Calls: <Anonymous> ... lapply -> FUN -> pipelineCNA -> classifyTumorCells -> hclust
Execution halted

any cues?

AntonioDeFalco commented 1 year ago

Hi @Ilarius, Help me understand what kind of data this happens with, I see that you are using multi-sample analysis but I see that when analysing each individual sample in your listCountMtx you have a sample with 71634 initial cells how come?

Ilarius commented 1 year ago

It's ovarian cancer: first sample has 71634 initial cells and the second one 73644. That's because I only load a matrix with cells with at least 200 features otherwise I have to allocate a 1Tb vector in R!

In the end I used the final filtered matrix (more or less 10k each) and the same code worked. I get that the cells that I thought to be more likely tumoral (given some markers) are enriched in cells found as "tumoral" by your algorithm. However, also a significant proportion of blood cells (which is a minority compared to the overall cells in the experiment and should not be aneuploid) is also detected as tumoral, and this makes the results less reliable. Do you think using filtered matrix could have generated this problem? How important is to start with the unfiltered matrix?

AntonioDeFalco commented 1 year ago

I believe that using the filtered matrix is the correct procedure , to check for incorrectly classified cells you can view the heatmap to see if the separation was done correctly. Some errors can sometimes be caused by cells with noisier signal. You can improve the final result by passing SCEVAN more cells on which you are confident are normal cells as a parameter _normcells .

Regards

Ilarius commented 1 year ago

I did not use norm cells because the documentation says: "norm_cells : Vector of normal cells if the classification is already known and you are only interested in the clonal structure (optional)".

So I know that since it is a solid tumor the tumoral cells are in the epithelial cluster, and not in the blood cell clusters.

PS. Is there somewhere the code that you use for the heatmaps and other visualization that you show in this vignette?

http://htmlpreview.github.io/?https://github.com/AntonioDeFalco/SCEVAN/blob/main/vignettes/IntratumoralHeterogeneityInGlioblastoma.html

AntonioDeFalco commented 11 months ago

If you know cells in the count matrix for which you are confident that are normal cells you can pass It as norm_cells parameter, It will be used to create e reference and identify all diploid cells.

All code is public you can find in this GitHub.

ahdee commented 8 months ago

@Ilarius just a random idea while reading through this. What about using your cell annotations ( blood cells ) as a source of "normal" cells. May be set a seed and randomely draw 2-3k cells? You mention that most likely these cells should not be cancerous? Going even further perhaps only selecting blood cells with in certain cell cycle phase and/or low expressing genes particular to the cancer type u are looking for?

ahdee commented 8 months ago

I believe that using the filtered matrix is the correct procedure , to check for incorrectly classified cells you can view the heatmap to see if the separation was done correctly. Some errors can sometimes be caused by cells with noisier signal. You can improve the final result by passing SCEVAN more cells on which you are confident are normal cells as a parameter _normcells .

Regards

@AntonioDeFalco Hi it does'nt look like the function multiSampleComparisonClonalCN have the option to pass norm_cell? I'm using version: SCEVAN_1.0.1 thanks!

AntonioDeFalco / SCEVAN

Error in hclust(parallelDist::parDist(t(CNA_mtx), threads = par_cores, : size cannot be NA nor exceed 65536 #69