Closed grst closed 4 years ago
Hi Gregor,
Sorry for the late response, I missed this earlier. It's strange that the problem goes away after raising the minimum cells threshold to 2000. We normally set the gene filter to 3 cells, which is quite low, but has never caused an issue before. Are you starting the nextflow pipeline from data you've already filtered and processed, or are you starting from raw data? The pipeline will perform filtering automatically prior to starting the SCENIC steps so I wonder if this is what's causing the problem.
The data is already filtered and processed the folloowing way
From what I understand from #20, all this shouldn't make a difference. Do you think I should run the pipeline from the raw data (i.e. unfiltered & untransformed)?
The filtering steps you've listed sound fine, but if you're using the full Nextflow pipeline in https://github.com/aertslab/SCENICprotocol these filtering steps are already built-in (except for the normalization), so I'm wondering if applying them twice is causing the issue. Could you try starting the pipeline from the raw data, letting the pipeline apply the filters? You would add these the end of your run command as extra params:
nextflow run aertslab/SCENICprotocol [...] \
--thr_min_genes 700 \
--thr_min_cells 2000 \
--thr_n_genes 10000 \
--thr_pct_mito 0.11
Alternatively, you could try just running the grnboost step on the filtered loom that you created, and see if completes successfully. The easiest way to do that is to cd into that work directory /home/sturm/projects/2019/SCENICprotocol/work/10/7b67ff30f0b31284f32c5597d68dea
, and run the command nextflow generated, substituting your loom file (looks like scenic/expr_mat.loom
, you may have to copy it into that working directory or mount the proper path though). So something like:
singularity run shub://aertslab/pySCENIC:0.9.18 pyscenic grn \
--num_workers 1 -o adj.tsv --method grnboost2 \
--cell_id_attribute CellID --gene_attribute Gene \
expr_mat.loom allTFs_hg38.txt
Thanks, @cflerin! I'll try and report back.
I now started with the grn
step and found that the problem seems to be related to the loom file. If I use a csv
file with the same data, it works perfectly. Maybe I am just not constructing it the right way?
I generate the loom file using scanpy
from an other anndata
object to get rid of the unnecessary metadata:
adata_loom = sc.AnnData(
## X is in sparse format
X=adata_expr.X,
var=pd.DataFrame().assign(Gene=adata_expr.var.index.values).set_index("Gene", drop=False),
obs=pd.DataFrame().assign(
CellID=adata_expr.obs.index.values,
nGene=adata_expr.obs["n_genes"].values,
nUMI=adata_expr.obs["n_counts"].values,
).set_index("CellID", drop=False),
)
## doesn't work
adata_loom.write_loom("/home/sturm/projects/2019/SCENICprotocol/scenic/expr_mat.loom")
## works
adata_loom.to_df().to_csv("/home/sturm/projects/2019/SCENICprotocol/scenic/expr_mat.csv")
This is pretty strange. I haven't used that loom export function directly, but it doesn't look like it's doing anything that would break the import step in pySCENIC. But typically, I would run:
import loompy as lp
row_attrs = {
"Gene": np.array(adata.var_names) ,
}
col_attrs = {
"CellID": np.array(adata.obs.index),
}
lp.create("filtered.loom",adata.X.transpose(), row_attrs, col_attrs)
Maybe another possibility is a conflict in the loompy version. If you're using v3+ to create the loom, it might cause problems in pySCENIC 0.9.18, which uses 2.0.17. (Although I thought that this would have worked). To test this, you could use pySCENIC 0.9.19, which has loompy 3. With Singularity, you'll have to pull from DockerHub:
singularity build aertslab-pyscenic-0.9.19.sif docker://aertslab/pyscenic:0.9.19
Upgrading did not do the trick, so it must be due to the way scanpy exports the loom
file.
I didn't have a chance to test creating the file with loompy
directly, I think I'll just stick with the csv
file for now.
I finally got a chance to test the AnnData.write_loom
function. While it doesn't produce exactly the same loom file (by md5sum), everything else is equal, and running grnboost2 on each loom with the same seed produces the same results. So I'll close this for now, but feel free to reopen if there are further issues.
Hi @cflerin and @grst ,
I am having the exact same issue currently. I am using the image:
aertslab-pySCENIC-0.9.18.img on an HPC.
I am creating my loom object in R using Seurat and the as.loom() function with the current version of Seurat (Seurat_3.1.2 ).
Curiously, this only happens when my data frame is large (~64k cells). A subset of the same data with only 3k cells exported works fine without any issue using the same setup.
Any suggestions on how to work around this issue? Based on @grst comments it seems using a .csv file instead of loom should work? It might be worth adding alternative input types to the tutorial in case people run into issues like this so they can try to work around them.
I will try saving the loom object using Scanpy instead of Seurat and report back with the outcome. Please let me know if you have any other suggestions in the meantime.
Thanks!
Hello @cflerin,
I am also getting this read-only error. I’ve made the loom file following the tutorial but had a .mtx file rather than a 10x .mtx.gz so had to add the obs.index and var.index manually could this cause the read.only error? If yes, is there any other way to make a loom file from a sparse matrix? as.loom() doesn’t work on sparse matrices…
Thanks!
Hi @acjmmartin ,
If you're using Scanpy, there is a way to import mtx files directly. See scanpy.read_mtx.
Then, you can create a barebones loom file using:
row_attrs = {
"Gene": np.array(adata.var.index) ,
}
col_attrs = {
"CellID": np.array(adata.obs.index) ,
"nGene": np.array( np.sum(adata.X.transpose()>0 , axis=0)).flatten() ,
"nUMI": np.array( np.sum(adata.X.transpose() , axis=0)).flatten() ,
}
lp.create( f_loom_path_unfilt, adata.X.transpose(), row_attrs, col_attrs )
This should create a sparse loom if adata.X
is also sparse.
@cflerin Thank you for your rapid reply! That is exactly how I did it except that scanpy.read_mtx() removes the indices so I had to add those manually which is maybe where the read-only came in as the indices are automatically converted to tuples? Any way around this?
adata = sc.read_mtx(gene_expression_matrix)
adata = adata.T
genes_read = pd.read_csv("/g/scb/zaugg_p/amartin/scROSMAP/genes.tsv", sep="\t", header=None)
cells_read = pd.read_csv("/g/scb/zaugg_p/amartin/scROSMAP/cells.tsv", sep="\t", header=None)
cell_type_read = pd.read_csv("/g/scb/zaugg_p/amartin/scROSMAP/cell_types.tsv", sep="\t", header=None)
genes_read = genes_read.values
cells_read = cells_read.values
cell_type_read = cell_type_read.values
adata.var.index = genes_read
adata.obs.index = cells_read
bad_cell_index = adata.obs.index
out1=()
for a in bad_cell_index:
out1 +=a
adata.obs.index = out1
bad_gene_index = adata.var.index
out2=()
for a in bad_gene_index:
out2 +=a
adata.var.index = out2
adata.obs["Celltype"] = cell_type_read
row_attrs = {
"Gene": np.array(adata.var.index) ,
}
col_attrs = {
"CellID": np.array(adata.obs.index) ,
"Cell_type": np.array(adata.obs["Celltype"].values),
"nGene": np.array( np.sum(adata.X.transpose()>0 , axis=0)).flatten() ,
"nUMI": np.array( np.sum(adata.X.transpose() , axis=0)).flatten() ,
}
lp.create( unfiltered_loom_matrix , adata.X.transpose(), row_attrs, col_attrs )
This is the code I used to make sure the indices weren’t double tuples. After filtering following the steps in the tutorial inputting the filtered loom into the GRN step gives be the read-only error.
Any way around this would be great! Thanks!!
I have now made the loom file directly from my seurat file using as.loom() and that worked :)
Hi @cflerin I’ve kept getting this error even when using chmod 777 on my input loom and so looked more into it. Turns out the problem doesn’t lie with the input loom but with using dask.array with a distributed scheduler. According to "Tackle "ValueError: buffer source array is read-only" #1978” the error only shows up when the arrays transit across nodes or to the disk cache which is prevented by the scheduler if enough RAM and CPU power are available (which isn’t the case for large matrices). I have no idea how to fix this problem so any advice would still be welcome but thought I’d share this realisation.
This is a re-post of the issue originally posted at https://github.com/aertslab/scenic-nf/issues/3. I now switched to the newer version of the pipeline (SCENICprotocol) and the problem persists.
Unlike #45, this happens in the GRN step. Also, as I'm using the singularity image, the dask version is fixed to 1.0.0.
When I filter the genes by the minimal number of cells with scanpy before running pySCENIC, the problem appears to go away:
However, what I understood from the documentation, this step shouldn't be necessary. Also 2000 seems quite a lot (there are smaller clusters than that for sure!). It still fails with
< 1000
cells.Dataset
Output