COMETSC running but not giving output

achamess commented 4 years ago

I'm excited to use this tool, but it's been a struggle to get it to work for me. I think the issue is my input files because I can run the example dataset.

I am working with a Seurat object. I've exported the markers, umap dimensions and cluster calls as tab separated text. I make a call to Comet from the command line it and looks like it's running. It certainly is from top. It goes for about 3 hours. An output folder is created. But the folder is empty. It doesn't throw any specific errors but it does show the following during run-time:

Creating discrete expression matrix...
Insufficient floating point precision for calculating or reporting the exact XL-mHG test statistic; the true value is too small. Using "0" instead.(The XL-mHG p-value will also be reported as "0".)
Insufficient floating point precision for calculating or reporting the exact XL-mHG test statistic; the true value is too small. Using "0" instead.(The XL-mHG p-value will also be reported as "0".)

I am not sure what the problem is. Is there a stderr or log file to see what's going on?

Also, relatedly, the docs would benefit greatly from a tutorial showing how to get the input files out from a Seurat object, since that is such a common procedure.

Here is my code to get the input files out from Seurat and to the command line.

# matrix
matrix_cometsc <- GetAssayData(so) # so is Seurat Object

write.table(as.matrix(matrix_cometsc), file=here("data", "COMETSC", "markers.txt"), row.names=TRUE, col.names=TRUE, sep = "\t", quote = FALSE)

#UMAP embeddings
umap_cometsc <- Embeddings(so, reduction = "umap")
write.table(umap_cometsc, file=here("data", "COMETSC", "vis.txt"), row.names=TRUE, col.names=FALSE, sep = "\t", quote = FALSE)

#cluster IDs
cluster_cometsc <- noquote(as.matrix(Idents(so)))
write.table(cluster_cometsc, file=here("data", "COMETSC", "cluster.txt"), row.names=TRUE, col.names=FALSE, sep = "\t", quote = FALSE)

Part of the issue is with the marker (matrix) because of that first tab above the row names. I had to manually add it like this:

sed '1s/.*/\t&/' markers.txt > markers2.txt

Also, my command to Comet is the following:

#! /bin/bash
source ~/comet/bin/activate
Comet markers2.txt vis.txt cluster.txt -C 16 -K 4 -Count true output/

And for some reference, here is a sample of markers2.txt with the tabs indicated by ^I

^ID1_TTCAGGATCAAGCCAT^ID1_GTGGAGATCTGCTTAT^ID1_GCACGGTCACTCAGAT^ID1_TATACCTGTCTTACTT
MIR1302-2HG^I0^I0.0766241526725224^I0^I0
FAM138A^I0^I0^I0^I0
OR4F5^I0^I0^I0^I0
AL627309.1^I0.103146952196364^I0.0766241526725224^I0.0823802232731239^I0.0918193591402592
AL627309.3^I0^I0^I0^I0
AL627309.2^I0^I0^I0^I0
AL627309.4^I0^I0^I0^I0
AL732372.1^I0^I0^I0^I0

Cnrdelaney commented 4 years ago

Hi, based on what I see here the problem is likely that the cells in your markers file aren't being lined up with the cluster file, especially if the tool runs all the way through with no errors others than the one you posted here (which is totally normal to see). Would you mind posting a small excerpt from your cluster file? I personally don't use Seurat (Scanpy instead) but it looks like your formatting is correct so we will have to debug . Thanks for being patient!

achamess commented 4 years ago

Hi. Thanks for the quick reply and your help. Here are the files I'm using. I put the full count matrix file (matrker2) and also made a smaller version of the (marker_small) file (subsetted on first thousand rows and columns).

https://www.dropbox.com/sh/1q5e7paoeypqwmg/AACwsbtnvc0Kc8nKIJWGQjGba?dl=0

Cnrdelaney commented 4 years ago

Hi,

Thanks for sending the files along. I only looked at the marker_small file as the expression matrix for my testing but I was able to get results just fine with those files, so I am not exactly sure what the problem is. First thing I would try would be to remove all of the extra functionality on your command (-K, -Count, -C) just to see if one of those is causing the issue. Another thing I forgot to ask is the version of python you are using? Comet is not compatible with 3.7 in case you are using that. Also if you could let me know the version of Comet you happen to have installed that would be perfect.

achamess commented 4 years ago

Thanks! Glad it worked. I'm using Python 3.6.8 on a Ubuntu 18.04 system. Version of Comet is 0.1.12. I'll try it again without the extra functionality. I did that because it was taking forever to run without the -C. Thanks for your help. WIll keep you posted.

Cnrdelaney commented 4 years ago

Yes, with 16 clusters I imagine things are taking awhile. Maybe for now just getting the small marker file to work will make that quicker. We have a pull request currently under review by another member of the lab that should speed things up, so hopefully once this is working those changes will be implemented @oshahid

achamess commented 4 years ago

Hi. I ran the markers_small file without any extra settings and it produced the expected output. So we know it can work. That only had 1000 cells. Ill try the full matrix now and see what happens.

achamess commented 4 years ago

Hi again. So I moved onto my actual data, which has about ~17000 cells. If I run it using standard mode (no extra cores), it barely moves for hours. So I downsampled to 5000. It's cranking along, but at this rate it might take 24 hours or more. I did try with -C 16 and downsampling and it looks like it's doing stuff. It ran to the last cluster, but only gave output from the final clusters. Strange. Do you all run with just a single core or do you use the -C option?

Cnrdelaney commented 4 years ago

Ok, this is a bit odd.. It does sound like potentially a bug in the cores option, however I would note that actually the number of cells is not really the speed-limiting factor here, it is the number of genes you are considering. Especially if you set K to 4, it is looking at a ton of different marker panel combinations. When I run things I generally dont use the -C option, instead I change the gene list for which to check against. For instance, we have a list of surface marker only genes that was pulled from a publication that will drastically reduce the number of genes in the combinatorics. I know sometimes we want to take a look at the whole landscape, but there are certainly a lot of genes that can be taken out without losing any interesting information. For starters, I would recommend switching to -K 2 and either using our default gene list or making your own gene list, I often try to keep it to around 1k genes. Do you know how many genes are in your large markers file by default?

achamess commented 4 years ago

Thanks for the tips. My current commands are just the following:

Comet markers.txt vis.txt clusters.txt output/ -D 5000

I am running all genes though, which it sounds like is the problem.

I'll try with your default gene list.

If I were to make my own, what do you recommend? Using the top DEGs (maybe say top 100 from each cluster?)

Cnrdelaney commented 4 years ago

top DE genes is definitely reasonable, although it would be best to tailor it towards the questions you are trying to ask. Hopefully the tool will be able to pick up new gene combinations that would not have been easily found by just looking at the DE genes :) If you are looking for surface markers for instance, you'd want to use the surface marker list. The specifier is -g file_name.txt and if you do this you do not need to use the downsampling.

Here is our default mouse list, www.cometsc.com also has the human version of this if needed! Default_genes.txt

achamess commented 4 years ago

OK. I'll do that. I'm working with neural single cell data, so surface markers are less important to me. My hope for COMET was that it would help me systematically find combinations of genes that would more specifically identify cell types. Single genes rarely are enough. But with 2 or 3 genes, one can really narrow down on a set that uniquely identifies a cluster. That is my hope.

I will try the top 100 DEGs for each cluster, which should get the gene list down to 1150, much reduced compared to the 25000+ it's trying now.

Thanks for your help. Will let you know how it goes.

Cnrdelaney commented 4 years ago

Great! This is exactly the use that we intended. Please feel free to continue replying to this thread with questions or concerns, happy to help.

MSingerLab / COMETSC

COMETSC running but not giving output #8