Missing Data Files? - Githubissues

stevedough123 commented 4 years ago

I'm currently attempting to run RCTD on the cerebellum data given in the Robust decomposition of cell type mixtures in spatial transcriptomics. However, I can't find the dge.csv, cell_type_dict.csv, and meta_data.csv files that are needed for the single-cell reference section. In the data on the Single Cell portal, there an .RDS file which contains the information of the .csv files. How do I input this into RCTD?

If I try to use the terminal to run the pipeline, I get this error: homes-iMac:RCTD_base home$ Rscript R_scripts/callDoublets.R 1 Error: unexpected symbol in "Script started" Execution halted

dmcable commented 4 years ago

Hi Steve, all the data used to run RCTD is here https://singlecell.broadinstitute.org/single_cell/study/SCP948/robust-decomposition-of-cell-type-mixtures-in-spatial-transcriptomics. Actually, the file 'scRefSubsampled1000_cerebellum_singlenucleus.RDS' contains the preprocessed cerebellum reference as a Seurat object. This allows you to skip the step of creating the .csv files. If you are following the instructions here 'https://github.com/dmcable/RCTD,' then you can proceed to step 2 and make sure to set 'reffolder', and 'reffile' to reference this RDS file.

For the error that you describe, it is because you need to get steps 1-2 working before running the pipeline. I recommend following along with the ‘spatial-transcriptomics’ vignette to test if it is working correctly.

stevedough123 commented 4 years ago

Thanks. I also get this error when I'm running the code in R studio (image below). I'm assuming it's because 'reference' is suppose to contain the Seurat object from the preprocessing step. But since I'm skipping that step like you mentioned, the next part wouldn't recognize any Seurat object. Do you have any idea how to overcome this issue?

I'm also getting the following issues when I run the subsequent chunks of code in the Vignette:

Thanks for your help.

dmcable commented 4 years ago

I see, so the first thing I would recommend is to make sure that you can successfully run the spatial-transcriptomics Vignette, by following the instructions here github.com/dmcable/RCTD. This should run without errors in 5-10 minutes and should be instructive for how the pipeline works.

To fix your error of 'reference not found', you can load in the Seurat reference as: reference <- readRDS('location of reference file.RDS')

Next, if you also haven't yet created the slide-seq puck .RDS file, you can follow the instructions in the 'Slide-seq data' section of the spatial-transcriptomics Vignette.

Your goal should be to make sure the following line of code initializes RCTD successfully: iv <- init_RCTD(gene_list_reg = F, get_proportions = F, load_info=F) In order for this to work, you need to have the following files on your computer: 1) The single nucleus reference .RDS file 2) The slide-seq puck .RDS file. 3) The config.yml file, which contains SpatialRNAfolder, puckrds, reffolder, and reffile fields are updated to contain the file locations of (1) and (2).

After you are able to successfully run the init_RCTD function, you have the option of following the 'Running RCTD from Command Line' (github.com/dmcable/RCTD) to run the pipeline in a streamlined fashion. The first step would be platform effect normalization: Rscript R_scripts/fitBulk.R. My recommendation for the spatial-transcriptomics vignette dataset would be to run each line in the vignette to see how it works, but then for the cerebellum dataset, you could just run from command line after getting init_RCTD working.

The error you received for iv <- init_RCTD(load_info_renorm = T) is because you first need to successfully run the previous parts of the pipeline. In particular, the 'Platform Effect Normalization' step will create the renormalized cell type profiles. Likewise with the subsequent lines -- they depend on successful execution of the previous lines. So, your first step is to get the iv <- init_RCTD(gene_list_reg = F, get_proportions = F, load_info=F) line working.

stevedough123 commented 4 years ago

Thanks again for your help. I think I've almost got it running.

After successfully running ### Running Platform Effect Normalization, the code runs into an error at the following step:

For a more detailed description, here is what it says in the R studio console:

Also, what is the difference between the 1000cellsSubsampled_cerebellum_singlecell.RDS.zip data file and the scRefSubsampled1000_cerebellum_singlenucleus.RDS.zip data file?

dmcable commented 4 years ago

Ok, this is an issue with the parallelization, and unfortunately these kinds of errors can be tricky and opaque, but I am definitely interested in figuring out why this is occurring. Is this on the Vignette dataset or the Cerebellum dataset? I'm not exactly sure, but here are a few ideas:

1) Make sure the required packages are installed:

library(doParallel)
library(parallel)
library(foreach)

This should hopefully not return any errors.

2) Next, can you execute the following code without error:

cl <- parallel::makeCluster(4,outfile="")
doParallel::registerDoParallel(cl)
parallel::stopCluster(cl)

If so, then it is a problem with the foreach loop.

3) It could be a memory issue. How many GB of RAM does your computer have? You could try closing other memory intensive applications. You might be using a lot of memory in your R session. Can you please show the output of gc()?

4) Could you please provide the output of sessionInfo()?

To answer the second question, there were two different Cerebellum annotated references used in our study, a single-nucleus RNA-seq dataset (scRefSubsampled1000_cerebellum_singlenucleus.RDS.zip) and a single-cell RNA-seq dataset (1000cellsSubsampled_cerebellum_singlecell.RDS.zip). The single-nucleus dataset was used as training data for the Slide-seq Cerebellum, so I believe you should use that one. The single-cell dataset was used to generate a "simulated slide-seq" dataset in order to benchmark the performance of RCTD.

stevedough123 commented 4 years ago

Okay thanks, I'll see if those work.

I tried loading in the packages you sent, but it's still not working. After running the code you sent, I got this error:

The output of gc() and sessionInfo() are shown below:

Also, I have 8 gb of RAM on my iMac. Does using sc RNA-seq vs the single nucleus RNA-seq change the ability of RCTD to run properly or not?

On another note, I also tried running RCTD in my terminal, but I still got this error

dmcable commented 4 years ago

Great, for scRNA-seq vs snRNA-seq, RCTD was designed to work for either platform. In fact, for the hippocampus Slide-seq dataset, we tested it on single cell. For the cerebellum Slide-seq, we chose the single-nucleus dataset because we thought it was higher in quality, but there is no inherent reason that we did not use the single-cell for cerebellum.

I can't see the command that you used in terminal, but I believe it was Rscript R_scripts/chooseSigma.R. If so, then it seems like you successfully ran the part of RCTD that you were previously experiencing errors. I believe that you may be able to fix the issue by clearing your R environment and restarting Rstudio. Otherwise, since you were able to run it as a terminal command, that is perfectly fine.

Now that we have identified the error, after you restart RStudio, you can test for the error as:

library(parallel)
cl <- parallel::makeCluster(4,outfile="")

As for your error you are experiencing with Rscript R_scripts/callDoublets.R 1, I am not sure exactly the issue, but it should be relatively easy to fix. Do you have any more information? You could try opening up Rstudio and running the first few lines of callDoublets.R to see if it works.

stevedough123 commented 4 years ago

So just to clarify, if I run the code chunks in R Studio in the spatial-transcriptomic.Rmd file, it won't work, but running it in my terminal will work.

Also, you are correct, I did run the chooseSigma.R file before the callDoublets.R file. Strangely, this is what shows up when I open the file:

dmcable commented 4 years ago

Ok, sorry its not working in RStudio, but it's completely fine going forward to just run it in terminal, since the pipeline is fully encapsulated in the scripts.

Interesting, it looks like the callDoublets.R file somehow got overwritten. You can download it again from the RCTD github, or for convenience I have attached the file here (to be unzipped first): callDoublets.R.zip.

stevedough123 commented 4 years ago

Thanks. When I'm running the callDoublets file, do I run it 20 times, and increase the number from 1-20 each time I run it on a new line?

stevedough123 commented 4 years ago

Just to make sure, is there any issue with the way the script is running in my terminal right now?

dmcable commented 4 years ago

Hey Steve,

Yes that looks good, though I should point out that it says "number of cell types used: 11". Which scRNA-seq reference are you using? I believe the cerebellum snRNA-seq dataset has 19 cell types. You are correct about how to run the callDoublets.R script 20 times. It would be tedious to enter in each of the 20 jobs by hand. In the next version of RCTD, we have made it approximately 30 times faster, and as a result, we will remove this annoying step of splitting the data into multiple chunks. For now, you have the following options:

The splitting into multiple chunks is really only useful (and designed) for people who are using a computer cluster where they can run multiple jobs at once. Since you are running on a standard computer, you could avoid this entirely by setting the n_puck_folds variable to 1 (instead of 20 by default) in the config.yml file. However, if you change n_puck_folds, be warned that you do need to go back and run the scripts from the start.
You could write a (bash) script that will loop over 1 to 20 and run the callDoublets.R script for each value.

stevedough123 commented 4 years ago

Hi,

I have 11 cell types because I'm using the 1000cellsSubsampled_cerebellum_singlecell.RDS.zip file, not the single nucleus one.

dmcable commented 4 years ago

Got it, in that case everything checks out!

stevedough123 commented 4 years ago

Thanks. So after I'm done running callDoublets.R 20 times, I just run the following line and the output is downloaded onto my Mac?

Rscript R_scripts/gather_results.R

Also, what exactly is RCTD outputting? Is there any further analysis or scripts I have to run in order to get figures and tables afterward?

Thanks for your help!

stevedough123 commented 4 years ago

I just tried running on the larger single nucleus dataset, but I'm getting this issue:

I'm assuming this has to do with my RAM? Or am I incorrect?

dmcable commented 4 years ago

Yes, your R session ran out of memory, so you should clear all the objects in your R environment and reset your R session.

You are correct about running callDoublets.R at the end. It will aggregate all of the data into the results object, and then it will generate several plots (saved in the 'results' folder). If you would like to make different plots/tables than what it makes, you can select the results object and process and/or export it.

stevedough123 commented 4 years ago

Thanks.

Following up on one of my earlier messages, I wanted to see if there were any numerical data values associated with the results? Mainly, I wanted to see if there was a way to compare your results to the "ground truth" or test for certain factors such as accuracy and precision.

dmcable commented 4 years ago

Hello, that's a good question, and unfortunately ground truth cell type information cannot be collected for Slide-seq datasets. As a result, we develop several strategies in our paper to validate RCTD:

1) Validation of RCTD on "ground truth" simulated Slide-seq datasets of computational mixtures of known single cells. This is done cross platform (single-nucleus to single-cell) to incorporate the challenging existence of platform effects. Given a ground truth dataset, we can calculate any performance statistic we desire, and we demonstrate several in the paper (single cell type identification accuracy, doublet detection accuracy, doublet cell type identification accuracy, cell type proportion estimation).

2) Of course it is still important to validate Slide-seq data directly. We employ the following strategies in the paper: a. validation of spatial localization of cell types using prior biological knowledge of spatial organization in the cerebellum and hippocampus. b. comparing RCTD cell type predictions to localization of cell type marker genes or metagenes (i.e. summed expression of multiple marker genes) c. Making sure that "singlets" have only marker genes from a single cell type, whereas "doublets" have marker genes from two cell types. (We demonstrated this in the case of Bergmann / Purkinje calls). d. Making sure that doublets only occur between pairs of cell types that are known to co-localize spatially.

Of course, we are always open for better ideas on how to validate RCTD. Hope this helps!

stevedough123 commented 4 years ago

Sure, thanks for the information. For new datasets that I want to input, is there a way to obtain these validation results?

dmcable commented 4 years ago

Sure, I have written instructions for reproducing all of the validation / figures in the paper here https://github.com/dmcable/RCTD/tree/dev/AnalysisPaper. I recommend clicking the links under 'Generating Main Figures' to see the R code that was used to generate each figure. Hope this helps!

dmcable commented 4 years ago

Several validation plots are automatically output by the gather_results.R script. If you want to compare to specific marker genes, you can plot those individual genes and compare. If you have a more specific example, I'd be happy to help.

Future datasets you may have will work fine if they are put into the file format that RCTD accepts (counts matrix and coordinates matrix).

stevedough123 commented 4 years ago

Thanks. I'm trying to run a different dataset through your pipeline, but we created a psuedo-Visium dataset that we will use as the "ground truth." Also, where are the validation plots generated by gather_results.R ? All I see are the following figures:

dmcable commented 4 years ago

Cool - could you please explain what you mean by pseudo-Visium dataset?

Yes, those are all of the plots that are generated by default. Do you have a particular plot you are looking for?

stevedough123 commented 4 years ago

I'm mainly looking to represent the detection of spatially variable genes within specific cell types (imputation of genetic information). For example, I want to recreate Figure 6 (particularly Figure 6D) and possibly with different genes from my own dataset. In other words, I'm mostly looking to better understand and recreate the sections in your paper "RCTD enables detection of spatially variable genes within cell type" and "Expected cell type-specific gene expression", since these data are what I'm attempting to look at.

In this dataset that I'm using, however, I noticed that the files are different than the input required for this pipeline. Is it possible to transform the data, and if so, how do I go about inputing it into the pipeline?

Here are the files from my dataset with the different file names (these are the only 4 files):

For the above .csv file, do I need to get rid of the histology column?

I don't think the annotations are important, they are just an extra file.

Also, I'm wondering about the ## Single-Cell Reference section on the Vingenette. It asks for meta_data.csv, cell_type_dict.csv, and dge.csv or an RDS Seurat object with this data. So given the dataset I'm currently using, how would I run and get past these initial steps? Also, which of my files is for the MappedDGEForR.csvpart of the input?

Thanks.

stevedough123 commented 4 years ago

So I ran RCTD on a new dataset of mine, and I'm getting the following errors:

Is there something wrong with my data, or should I change parts of the code?

dmcable commented 4 years ago

Hi Steve, I encourage you to check out the new version of RCTD, which was just released yesterday! It is 30 times faster and significantly easier to use. The input procedure, however, has not changed.

I think your issues are because the data needs to be put in different format. You can find some guidelines about data format here: https://raw.githack.com/dmcable/RCTD/dev/vignettes/spatial-transcriptomics.html.

For the spatial transcriptomics data: Next, put the SpatialRNA data (in this case Slide-seq) in your ‘data/SpatialRNA’ directory (here ‘data/SpatialRNA/Vignette’). This needs to contain:

BeadLocationsForR.csv: a CSV file (with 3 columns, with headers “barcodes”, “xcoord”, and “ycoord”) containing the spatial locations of the pixels. --For this file, you are close (ST-lymphnodes-coordinates.txt), you just need to get rid of the histology column (in excel right click then delete column), and rename the column names.
MappedDGEForR.csv: a DGE (gene counts by barcodes) CSV file. Represents raw counts at each pixel. --You are also very close (ST_lympnodes_geneExpression.tsv). You should just be able to do file save as csv. All you need to do is change it from a tsv to csv file.

For the scRNA reference, here are the instructions. If you have a particular question on how to create one of the files, I can provide more specific help. A good approach for you would be to open the files in the ‘data/Reference/Vignette’ sample folder so you can see how they should be formatted.

Create a folder in ‘data/Reference’ e.g. ‘data/Reference/Vignette’ containing the following three files:

meta_data.csv: a CSV file (with 3 columns, with headers “barcode”, “cluster”, and “nUMI”) containing the numeric cluster assignment for each cell.
cell_type_dict.csv: a CSV file (with 2 columns, with headers “Cluster” and “Name”) containing the mapping between numeric cluster ID and cluster name. If you want a cluster to be filtered out of the single cell reference, you can leave the cluster name blank. The cell types must not contain the character ‘/’ or ‘-’.
dge.csv: a Digital Gene Expression (DGE) (barcodes by gene counts) CSV file in the standard 10x format. We use the dgeToSeurat function:

stevedough123 commented 4 years ago

Thanks for your help. I reformatted the files to match the 5 .csv files that you used for input for RCTD, as shown below.

MappedDGEForR:

BeadLocationsForR:

cell_type_dict

meta_data:

dge:

However, when I try to run RCTD on this pipeline, I still get the following errors:

Also, for the errors I stated in my previous message, that was after I correctly reformatted the data as you mentioned.

stevedough123 commented 4 years ago

When trying to read in the dge.csv, meta_data.csv, and cell_type_dict.csv files, I also get the following error:

Also, when I try to install the new version of RCTD, the R console doesn't give me a chance to give a number response to the following prompt:

And, I end up with the error:

stevedough123 commented 4 years ago

I noticed that your instructions to install the code on GitHub is not working properly (the devtools section of README). Could you help me look into this?

On Thu, Jul 16, 2020 at 11:29 AM Dylan Cable notifications@github.com wrote:

Hi Steve, I encourage you to check out the new version of RCTD, which was just released yesterday! It is 30 times faster and significantly easier to use. The input procedure, however, has not changed.

I think your issues are because the data needs to be put in different format. You can find some guidelines about data format here: https://raw.githack.com/dmcable/RCTD/dev/vignettes/spatial-transcriptomics.html .

For the spatial transcriptomics data: Next, put the SpatialRNA data (in this case Slide-seq) in your ‘data/SpatialRNA’ directory (here ‘data/SpatialRNA/Vignette’). This needs to contain:

BeadLocationsForR.csv: a CSV file (with 3 columns, with headers “barcodes”, “xcoord”, and “ycoord”) containing the spatial locations of the pixels. --For this file, you are close (ST-lymphnodes-coordinates.txt), you just need to get rid of the histology column (in excel right click then delete column), and rename the column names.

MappedDGEForR.csv: a DGE (gene counts by barcodes) CSV file. Represents raw counts at each pixel. --You are also very close (ST_lympnodes_geneExpression.tsv). You should just be able to do file save as csv. All you need to do is change it from a tsv to csv file.

For the scRNA reference, here are the instructions. If you have a particular question on how to create one of the files, I can provide more specific help. A good approach for you would be to open the files in the ‘data/Reference/Vignette’ sample folder so you can see how they should be formatted.

Create a folder in ‘data/Reference’ e.g. ‘data/Reference/Vignette’ containing the following three files:

meta_data.csv: a CSV file (with 3 columns, with headers “barcode”, “cluster”, and “nUMI”) containing the numeric cluster assignment for each cell.

cell_type_dict.csv: a CSV file (with 2 columns, with headers “Cluster” and “Name”) containing the mapping between numeric cluster ID and cluster name. If you want a cluster to be filtered out of the single cell reference, you can leave the cluster name blank. The cell types must not contain the character ‘/’ or ‘-’.

dge.csv: a Digital Gene Expression (DGE) (barcodes by gene counts) CSV file in the standard 10x format. We use the dgeToSeurat function:

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dmcable/RCTD/issues/2#issuecomment-659591744, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMUGA53KFGS2TMDTLQRFWADR35BJRANCNFSM4OCSFUBA .

dmcable commented 4 years ago

Sure, I see the error you are getting. This isn't directly related to RCTD since 'deldir' is not a required package for RCTD. When it says enter an empty line to skip updates, you may enter an empty line. If that does not work, I would research why your RStudio is not allowing you to skip updates. A separate option is to remove deldir or to first figure out how to properly install deldir by contacting the package maintainer or CRAN.

In MappedDGEForR.csv or BeadLocationsForR.csv you have duplicate rownames '1-Mar' and '2-Mar' (i.e. the first column) please check that you do not have any duplicates. In dge.csv, you have an incomplete last row, so I would examine the last row to make sure it is the same length. Also, once again it suggests that you have duplicates in your rownames. Also, many of these errors you can find more information by googling the error for the 'read_csv' function in R.

stevedough123 commented 4 years ago

Thanks for your quick response.

Here's what happens when I try to run the installation command line in terminal: [image: image.png]

And then when I try to install the new version of RCTD in Rstudio, I still get an error (also, if this is relevant, the R console doesn't give me a chance to give a number response to the following prompt): [image: image.png] [image: image.png]

Also, I'm not sure if you saw my previous message on GitHub, but it seems that I'm getting a similar error message in the preprocessing step when I attempt to run the older version of RCTD on my dataset. When trying to read in the dge.csv, meta_data.csv, and cell_type_dict.csv files, I get the following error: [image: image.png]

I would greatly appreciate your help in these issues.

Lastly, as an alternative, maybe I could send you the dataset and you could take a look at it or attempt to run RCTD on it? Sorry if this seems unexpected, but I've been working on this for quite some time and am trying to move along. However, I'm sure you're busy and may not have the time to do this, but if it's possible, we could communicate over email (you can reach me at sdou1@stanford.edu).

On Wed, Jul 22, 2020 at 3:25 AM Dylan Cable notifications@github.com wrote:

Sure, can you please post what happens when you try to execute the installation command?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dmcable/RCTD/issues/2#issuecomment-662374082, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMUGA52HF4GXHKJVBBTYCRTR425CRANCNFSM4OCSFUBA .

stevedough123 commented 4 years ago

Thanks for your help, I didn't see your previous message, apologies for sending the same message.

stevedough123 commented 4 years ago

So I was able to get the RCTD_plots produced. Is it possible to get the assignment scores from the cell types? Specifically, is it possible to export the locations of each cell type in a csv or RDS file (I'm assuming these should be the input for the plots that are output from RCTD)? That way I can compare the results with other methods.

stevedough123 commented 4 years ago

Also, is there a way to increase the size of the beads/pixels and make it so that there is less white space in the plots? Could I change the coloring of the beads as well? Lastly, is there a way to rotate or flip the plots, which I want to do in order to compare RCTD to results from different methods?

lulizou commented 4 years ago

Hi Steve, The assignment scores can be found in the results object in results$results_df. The current plotting capabilities aren't very flexible, but all of the information needed to create the plots is in the myRCTD object, so the easiest way would be to look at building/customizing the plots you want using the ggplot2 package, which is what RCTD uses. Best, Luli

dmcable commented 4 years ago

Hi Steve,

Regarding your previous question, I unfortunately wouldn't have time to personally run RCTD on your dataset, but I am glad to see you got everything working!

Dylan

dmcable / spacexr

Missing Data Files? #2