KevinMenden / scaden

Deep Learning based cell composition analysis with Scaden.
https://scaden.readthedocs.io
MIT License
71 stars 26 forks source link

Using a prediction file with fewer genes than that in training data results in error #103

Closed nagendraKU closed 3 years ago

nagendraKU commented 3 years ago

I did the scaden process step using a bulk RNA-seq dataset (named NewData) that has about 18,000 genes, and then ran predict using an older dataset that shares only about 15,000 genes with NewData. I got the following error.

KeyError: "Passing list-likes to .loc or [] with any missing labels is no longer supported. The following labels were missing: Index(['PANO1', 'BTNL3', 'SSSCA1', 'PNOC', 'USMG5',\n ...\n 'TMEM56-RWDD3', 'SIGLEC6', 'CCR6', 'VARS', 'CTAGE5'],\n dtype='object', length=703)

I fixed this by adding the missing genes to the old dataset, and setting zero counts for these genes across all samples. Now, I can get predict to run without errors, but I don't know if I should trust the results.

Would the proper way be to run the process-train-predict steps again with each dataset that needs to be predicted ?

KevinMenden commented 3 years ago

Hi @nagendraKU ,

yes, you need to run process with each dataset, because it subsets to the same number of genes and then performs normalization. The command doesn't take long though, so it shouldn't be a problem.

Best, Kevin