Identifying thresholds for cancer cell detection with medoids clustering

RegnerM2015 commented 2 years ago

Hi @dlroden, @johnyaku, @sunnyzwu, and @gAleryani

In the methods, the following approach was taken to identify thresholds for calling cancer cells:

Cells were plotted with respect to both their genomic instability and correlation scores. Partitioning around medoids clustering was performed using the pamk function in the R package cluster v.2.0.7-1 to choose the optimum value for k (between 2 and 4) using silhouette scores and the pam function to apply the clustering. Thresholds defining normal and neoplastic cells were set at 2 cluster s.d. to the left and 1.5 s.d. below the first cancer cluster means. For tumors where partitioning around medoids could not define more than 1 cluster, the thresholds were set at 1 s.d. to the left and 1.25 s.d. below the cluster means.

I was wondering what motivated your decision to apply partitioning in the scatter plot of genomic instability and correlation scores? The original inferCNV paper in glioblastoma used hard thresholds of 0.01 and 0.4 respectively for calling cancer cells. Based on my experience, these hard thresholds work well for samples that contain a clear population of copy number high cells and a clear population of copy number low cells in the scatter plot of genomic instability and correlation scores. However (in my experience), these hard thresholds do not work well for samples that contain only one large population, or 'blob', of cells in the scatter plot of genomic instability and correlation scores.

I was wondering if your group made similar observations, and if this motivated your decision to apply partitioning via medoids clustering to infer robust thresholds? Moreover, did you find that this partitioning approach was an improvement over the hard threshold approach?

I would think so, based on code from your inferCNV scripts:

# distinguish samples with 1 and > 1 clusters:
mono_samples <- c("CID3586", "CID3921", "CID3941", "CID3948", "CID4067", 
  "CID4290A", "CID4461", "CID4495", "CID4515", "CID4523", "CID4535", 
  "CID44041", "CID44991", "CID45171", "CID4513", "CID4398")
multi_samples <- c("CID3963", "CID4066", "CID4463", "CID4465", 
  "CID4471", "CID4530N", "CID44971")

dlroden commented 2 years ago

Hi Matt, This is correct, our breast cancer datasets tended to contain a gradient of genomic instability vs correlation scores, with some more resembling the 'blob' you describe in your data, and the range of values varied between datasets so any hard threshold set based on some of these datasets did not work for others.

The medoid partitioning method we chose, however, worked quite well for all datasets and the cells designated as neoplastic consistently had gene expression of a few key markers indicating malignancy (or normal epithelial cells), which is why we used it for the paper. Hope that helps and answers your questions. Cheers

RegnerM2015 commented 2 years ago

Thank you for your explanation! I appreciate the help.

Swarbricklab-code / BrCa_cell_atlas

Identifying thresholds for cancer cell detection with medoids clustering #13