Error: Illegal column names in formula interface. Fix column names or use alternative interface in ranger. - Githubissues

BUStools / BUS_notebooks_R

R vignettes for processing BUS format single-cell RNA-seq files

https://bustools.github.io/BUS_notebooks_R/

19 stars 9 forks source link

Error: Illegal column names in formula interface. Fix column names or use alternative interface in ranger. #3

Closed morganee261 closed 4 years ago

morganee261 commented 4 years ago

Hello,

Thanks for great tutorials, I am currently following your slingshot tutorial and I am running into an error when running the rand_forest line. here is what I get : model <- rand_forest(mtry = 200, trees = 1400, min_n = 15, mode = "regression") %>%

set_engine("ranger", importance = "impurity", num.threads = 3) %>%
fit(pseudotime ~ ., data = dat_train) Error in parse.formula(formula, data, env = parent.frame()) : Error: Illegal column names in formula interface. Fix column names or use alternative interface in ranger. Timing stopped at: 0.009 0 0.01

Could you please help with that?

thank you Morgane

lambdamoses commented 4 years ago

With an old version of SingleR, the notebook still runs. It needs to be updated for the Bioconductor version of SingleR. Can you post the column names of your dat_train?

morganee261 commented 4 years ago

thanks for your quick response.

here is my colnames for dat_train : head(dat_train [1:10,1:10]) pseudotime PAN3 SPATA22 TAOK1 MEG3 MGP MMP1 CARTPT CDHR4 S100A8 ctrl1_ips.1 0.08646903 -0.3691676 -0.4746009 -0.6379263 -1.597147 -1.014465 -0.1612485 -0.2450320 -0.2065135 -0.1057044 ctrl1_ips.3 0.00000000 -0.3075292 -0.4785825 -0.6960796 -1.608831 -1.022623 -0.1641750 -0.2478715 -0.2087436 -0.1077782 ctrl1_ips.4 0.94640118 -0.3788888 -0.4762593 -0.7251252 -1.602009 -1.017860 -0.1624653 -0.2462147 -0.2074416 -0.1065675 ctrl1_ips.6 0.00000000 -0.3638948 -0.4734457 -0.6897712 -1.593765 -1.012103 -0.1604027 -0.2442082 -0.2058676 -0.1051038 ctrl1_ips.7 0.00000000 -0.3663764 -0.4756155 -0.7253021 -1.600121 -1.016541 -0.1619926 -0.2457556 -0.2070812 -0.1062324 ctrl1_ips.8 0.00000000 -0.3596309 -0.4731918 -0.7175870 -1.593022 -1.011584 -0.1602170 -0.2440271 -0.2057257 -0.1049718

do you suggest I downgrade my SingleR?

lambdamoses commented 4 years ago

No, this is not related to SingleR, which was used to annotate cell types based on a reference. The first 10 entries do look fine. Probably this error was caused by gene symbols that starts with a number or those that contain "-", things that make them illegal variable names in R. Converting gene symbols to Ensembl gene IDs should solve this problem since Ensembl gene IDs are also legal R variable names.

morganee261 commented 4 years ago

Could you please let me know how I should do that ?

thanks again for your help!

lambdamoses commented 4 years ago

You can get Ensembl gene IDs and their corresponding gene symbols with biomaRt: https://bioconductor.org/packages/release/bioc/html/biomaRt.html If you have not used biomaRt before, it can be a bit intimidating. You can also use one of the tr2g functions in BUSpaRse to get Ensembl gene IDs and their corresponding gene symbols as well, though you will also get the transcript IDs. You'll see the code chunk calling tr2g_ensembl earlier in this slingshot tutorial.

Once you have a data frame with a column for Ensembl gene IDs and another column for gene symbols, say the data frame is called df, then you can convert gene symbols to Ensembl gene IDs with

colnames(mat) <- df$gene_id[match(colnames(mat), df$gene_symbol]

lambdamoses commented 4 years ago

Also note that since Ensembl is moving their servers, the archives will not be available until April 16, though the current version (99) is available. However, you can still access the older versions of Ensembl from Bioconductor, via AnnotationHub. See the RNA velocity notebook in this repo for an example.

lambdamoses commented 4 years ago

If you don't want to convert the gene symbols into Ensembl IDs, there's another work around: use make.names (in base R) to make all the column names legal.

morganee261 commented 4 years ago

I tried using make.names and it seems to be working now. do you know how long this step usually takes ?

thanks

lambdamoses commented 4 years ago

It depends on how many genes you are using, how many cells there are in your dataset, how many cores you use, and the other parameters for ranger. It took about a minute or so (didn't formally time it with system.time but it didn't take too long) in the tutorial, with 3 cores.