aertslab / pySCENIC

pySCENIC is a lightning-fast python implementation of the SCENIC pipeline (Single-Cell rEgulatory Network Inference and Clustering) which enables biologists to infer transcription factors, gene regulatory networks and cell types from single-cell RNA-seq data.
http://scenic.aertslab.org
GNU General Public License v3.0
441 stars 182 forks source link

infer_data failed for target XXX #108

Closed grst closed 4 years ago

grst commented 4 years ago

This is a re-post of the issue originally posted at https://github.com/aertslab/scenic-nf/issues/3. I now switched to the newer version of the pipeline (SCENICprotocol) and the problem persists.

Unlike #45, this happens in the GRN step. Also, as I'm using the singularity image, the dask version is fixed to 1.0.0.

When I filter the genes by the minimal number of cells with scanpy before running pySCENIC, the problem appears to go away:

sc.pp.filter_genes(adata_loom, min_cells=2000)

However, what I understood from the documentation, this step shouldn't be necessary. Also 2000 seems quite a lot (there are smaller clusters than that for sure!). It still fails with < 1000 cells.

Dataset

Output

N E X T F L O W  ~  version 19.10.0
Launching `./main.nf` [romantic_goldwasser] - revision: 7332b38e91

***
Parameters in use:
loom_input=scenic/expr_mat.loom
loom_filtered=filtered.loom
loom_output=xxx_scenic_integrated.loom
thr_min_genes=200
thr_min_cells=3
thr_n_genes=5000
thr_pct_mito=0.25
outdir=results
threads=1
TFs=scenic/allTFs_hg38.txt
motifs=scenic/motifs-v9-nr.hgnc-m0.001-o0.0.tbl
db=scenic/*.feather
pyscenic_output=xxx_scenic.loom
grn=grnboost2
cell_id_attribute=CellID
gene_attribute=Gene
pyscenic_tag=0.9.18
pyscenic_container=shub://aertslab/pySCENIC:0.9.18
***

***
Using 2 feather databases:
  /home/sturm/projects/2019/SCENICprotocol/scenic/hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.feather
  /home/sturm/projects/2019/SCENICprotocol/scenic/hg38__refseq-r80__500bp_up_and_100bp_down_tss.mc9nr.feather
***
Error executing process > 'GRNinference'                                                                                                                                                                             

Caused by:                                                                                                                                                                                                           
  Process `GRNinference` terminated with an error exit status (1)                                                                                                                                                    

Command executed:                                                                                                                                                                                                    

  pyscenic grn         --num_workers 1         -o adj.tsv         --method grnboost2         --cell_id_attribute CellID         --gene_attribute Gene         filtered.loom         allTFs_hg38.txt                  

Command exit status:                                                                                                                                                                                                 
  1                                                                                                                                                                                                                  

Command output:                                                                                                                                                                                                      
  preparing dask client                                                                                                                                                                                              
  parsing input                                                                                                                                                                                                      
  creating dask graph                                                                                                                                                                                                
  1 partitions                                                                                                                                                                                                       
  computing dask graph                                                                                                                                                                                               
  not shutting down client, client was created externally                                                                                                                                                            
  finished                                                                                                                                                                      

Command error:                                                                                                                                                                                                       
          0.        ],                                                                                                                                                                                               
         [0.        , 0.        , 0.30562347, ..., 0.        , 0.        ,                                                                                                                                           
          0.        ]], dtype=float32), ['NOC2L', 'HES4', 'CPTP', 'VAMP3', 'ENO1', 'AGMAT', 'HP1BP3', 'ZBTB40', 'LUZP1', 'RUNX3', 'ZNF683', 'GMEB1', 'ZCCHC17', 'HDAC1', 'ZNF362', 'SFPQ', 'STK40', 'MTF1', 'RLF', 'S
MAP2', 'NFYC', 'SCM                                                                                                                                                                                                  
  kwargs:    {}                                                                                                                                                                                                      
  Exception: ValueError('Metadata mismatch found in `from_delayed`.\n\nExpected partition of type `DataFrame` but got `NoneType`')                                                                                   

  'infer_data failed for target SPPL3' Retry (1/10). Failure caused by ValueError("Regression for target gene SPPL3 failed. Cause ValueError('buffer source array is read-only').").                                 
  'infer_data failed for target SPPL3' Retry (2/10). Failure caused by ValueError("Regression for target gene SPPL3 failed. Cause ValueError('buffer source array is read-only').").                                 
  /usr/local/lib/python3.7/site-packages/arboreto/algo.py:214: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.                                                            
    expression_matrix = expression_data.as_matrix()                                                                                                                                                                  
  Traceback (most recent call last):                                                                                                                                                                                 
    File "/usr/local/bin/pyscenic", line 10, in <module>                                                                                                                                                             
      sys.exit(main())                                                                                                                                                                                               
    File "/usr/local/lib/python3.7/site-packages/pyscenic/cli/pyscenic.py", line 408, in main                                                                                                                        
      args.func(args)                                                                                                                                                                                                
    File "/usr/local/lib/python3.7/site-packages/pyscenic/cli/pyscenic.py", line 67, in find_adjacencies_command                                                                                                     
      network = method(expression_data=ex_mtx, tf_names=tf_names, verbose=True, client_or_address=client, seed=args.seed)                                                                                            
    File "/usr/local/lib/python3.7/site-packages/arboreto/algo.py", line 41, in grnboost2                                                                                                                            
      early_stop_window_length=early_stop_window_length, limit=limit, seed=seed, verbose=verbose)                                                                                                                    
    File "/usr/local/lib/python3.7/site-packages/arboreto/algo.py", line 135, in diy                                                                                                                                 
      .compute(graph, sync=True) \                                                                                                                                                                                   
    File "/usr/local/lib/python3.7/site-packages/distributed/client.py", line 2506, in compute                                                                                                                       
      result = self.gather(futures)                                                                                                                                                                                  
    File "/usr/local/lib/python3.7/site-packages/distributed/client.py", line 1657, in gather                                                                                                                        
      asynchronous=asynchronous)                                                                                                                                                                                     
    File "/usr/local/lib/python3.7/site-packages/distributed/client.py", line 677, in sync                                                                                                                           
      return sync(self.loop, func, *args, **kwargs)                                                                                                                                                                  
    File "/usr/local/lib/python3.7/site-packages/distributed/utils.py", line 321, in sync                                                                                                                            
      six.reraise(*error[0])                                                                                                                                                                                         
    File "/usr/local/lib/python3.7/site-packages/six.py", line 693, in reraise                                                                                                                                       
      raise value                                    
    File "/usr/local/lib/python3.7/site-packages/distributed/utils.py", line 306, in f                    
      result[0] = yield future                       
    File "/usr/local/lib/python3.7/site-packages/tornado/gen.py", line 729, in run                        
      value = future.result()                        
    File "/usr/local/lib/python3.7/site-packages/tornado/gen.py", line 736, in run                        
      yielded = self.gen.throw(*exc_info)  # type: ignore                                                 
    File "/usr/local/lib/python3.7/site-packages/distributed/client.py", line 1498, in _gather            
      traceback)                                     
    File "/usr/local/lib/python3.7/site-packages/six.py", line 692, in reraise                            
      raise value.with_traceback(tb)                 
    File "/usr/local/lib/python3.7/site-packages/dask/dataframe/utils.py", line 521, in check_meta        
      errmsg))                                       
  ValueError: Metadata mismatch found in `from_delayed`.                                                  

  Expected partition of type `DataFrame` but got `NoneType`                                               
  distributed.process - WARNING - reaping stray process <ForkServerProcess(ForkServerProcess-1, started daemon)>
  distributed.nanny - WARNING - Worker process 18227 was killed by unknown signal                         
  /usr/local/lib/python3.7/site-packages/dask/config.py:161: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full
 details.                                            
    data = yaml.load(f.read()) or {}                 

Work dir:                                            
  /home/sturm/projects/2019/SCENICprotocol/work/10/7b67ff30f0b31284f32c5597d68dea                         

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`
cflerin commented 4 years ago

Hi Gregor,

Sorry for the late response, I missed this earlier. It's strange that the problem goes away after raising the minimum cells threshold to 2000. We normally set the gene filter to 3 cells, which is quite low, but has never caused an issue before. Are you starting the nextflow pipeline from data you've already filtered and processed, or are you starting from raw data? The pipeline will perform filtering automatically prior to starting the SCENIC steps so I wonder if this is what's causing the problem.

grst commented 4 years ago

The data is already filtered and processed the folloowing way

From what I understand from #20, all this shouldn't make a difference. Do you think I should run the pipeline from the raw data (i.e. unfiltered & untransformed)?

cflerin commented 4 years ago

The filtering steps you've listed sound fine, but if you're using the full Nextflow pipeline in https://github.com/aertslab/SCENICprotocol these filtering steps are already built-in (except for the normalization), so I'm wondering if applying them twice is causing the issue. Could you try starting the pipeline from the raw data, letting the pipeline apply the filters? You would add these the end of your run command as extra params:

nextflow run aertslab/SCENICprotocol [...] \
--thr_min_genes 700 \
--thr_min_cells 2000 \
--thr_n_genes 10000 \
--thr_pct_mito 0.11

Alternatively, you could try just running the grnboost step on the filtered loom that you created, and see if completes successfully. The easiest way to do that is to cd into that work directory /home/sturm/projects/2019/SCENICprotocol/work/10/7b67ff30f0b31284f32c5597d68dea, and run the command nextflow generated, substituting your loom file (looks like scenic/expr_mat.loom, you may have to copy it into that working directory or mount the proper path though). So something like:

singularity run shub://aertslab/pySCENIC:0.9.18 pyscenic grn \
--num_workers 1 -o adj.tsv --method grnboost2 \
--cell_id_attribute CellID --gene_attribute Gene \
expr_mat.loom allTFs_hg38.txt
grst commented 4 years ago

Thanks, @cflerin! I'll try and report back.

grst commented 4 years ago

I now started with the grn step and found that the problem seems to be related to the loom file. If I use a csv file with the same data, it works perfectly. Maybe I am just not constructing it the right way?

I generate the loom file using scanpy from an other anndata object to get rid of the unnecessary metadata:

adata_loom = sc.AnnData(
    ## X is in sparse format 
    X=adata_expr.X,
    var=pd.DataFrame().assign(Gene=adata_expr.var.index.values).set_index("Gene", drop=False),
    obs=pd.DataFrame().assign(
        CellID=adata_expr.obs.index.values,
        nGene=adata_expr.obs["n_genes"].values,
        nUMI=adata_expr.obs["n_counts"].values,
    ).set_index("CellID", drop=False),
)

## doesn't work
adata_loom.write_loom("/home/sturm/projects/2019/SCENICprotocol/scenic/expr_mat.loom")

## works
adata_loom.to_df().to_csv("/home/sturm/projects/2019/SCENICprotocol/scenic/expr_mat.csv")
cflerin commented 4 years ago

This is pretty strange. I haven't used that loom export function directly, but it doesn't look like it's doing anything that would break the import step in pySCENIC. But typically, I would run:

import loompy as lp
row_attrs = {
    "Gene": np.array(adata.var_names) ,
}
col_attrs = {
    "CellID": np.array(adata.obs.index),
}
lp.create("filtered.loom",adata.X.transpose(), row_attrs, col_attrs)

Maybe another possibility is a conflict in the loompy version. If you're using v3+ to create the loom, it might cause problems in pySCENIC 0.9.18, which uses 2.0.17. (Although I thought that this would have worked). To test this, you could use pySCENIC 0.9.19, which has loompy 3. With Singularity, you'll have to pull from DockerHub:

singularity build aertslab-pyscenic-0.9.19.sif docker://aertslab/pyscenic:0.9.19
grst commented 4 years ago

Upgrading did not do the trick, so it must be due to the way scanpy exports the loom file. I didn't have a chance to test creating the file with loompy directly, I think I'll just stick with the csv file for now.

cflerin commented 4 years ago

I finally got a chance to test the AnnData.write_loom function. While it doesn't produce exactly the same loom file (by md5sum), everything else is equal, and running grnboost2 on each loom with the same seed produces the same results. So I'll close this for now, but feel free to reopen if there are further issues.

FloWuenne commented 4 years ago

Hi @cflerin and @grst ,

I am having the exact same issue currently. I am using the image: aertslab-pySCENIC-0.9.18.img on an HPC.

I am creating my loom object in R using Seurat and the as.loom() function with the current version of Seurat (Seurat_3.1.2 ).

Curiously, this only happens when my data frame is large (~64k cells). A subset of the same data with only 3k cells exported works fine without any issue using the same setup.

Any suggestions on how to work around this issue? Based on @grst comments it seems using a .csv file instead of loom should work? It might be worth adding alternative input types to the tutorial in case people run into issues like this so they can try to work around them.

I will try saving the loom object using Scanpy instead of Seurat and report back with the outcome. Please let me know if you have any other suggestions in the meantime.

Thanks!

acjmmartin commented 4 years ago

Hello @cflerin,

I am also getting this read-only error. I’ve made the loom file following the tutorial but had a .mtx file rather than a 10x .mtx.gz so had to add the obs.index and var.index manually could this cause the read.only error? If yes, is there any other way to make a loom file from a sparse matrix? as.loom() doesn’t work on sparse matrices…
Thanks!

cflerin commented 4 years ago

Hi @acjmmartin ,

If you're using Scanpy, there is a way to import mtx files directly. See scanpy.read_mtx.

Then, you can create a barebones loom file using:

row_attrs = { 
    "Gene": np.array(adata.var.index) ,
}
col_attrs = { 
    "CellID":  np.array(adata.obs.index) ,
    "nGene": np.array( np.sum(adata.X.transpose()>0 , axis=0)).flatten() ,
    "nUMI": np.array( np.sum(adata.X.transpose() , axis=0)).flatten() ,
}

lp.create( f_loom_path_unfilt, adata.X.transpose(), row_attrs, col_attrs )

This should create a sparse loom if adata.X is also sparse.

acjmmartin commented 4 years ago

@cflerin Thank you for your rapid reply! That is exactly how I did it except that scanpy.read_mtx() removes the indices so I had to add those manually which is maybe where the read-only came in as the indices are automatically converted to tuples? Any way around this?

adata = sc.read_mtx(gene_expression_matrix)

adata = adata.T

genes_read = pd.read_csv("/g/scb/zaugg_p/amartin/scROSMAP/genes.tsv", sep="\t", header=None)
cells_read = pd.read_csv("/g/scb/zaugg_p/amartin/scROSMAP/cells.tsv", sep="\t", header=None)
cell_type_read = pd.read_csv("/g/scb/zaugg_p/amartin/scROSMAP/cell_types.tsv", sep="\t", header=None)
genes_read = genes_read.values
cells_read = cells_read.values
cell_type_read = cell_type_read.values
adata.var.index = genes_read
adata.obs.index = cells_read

bad_cell_index = adata.obs.index
out1=()
for a in bad_cell_index:
    out1 +=a

adata.obs.index = out1

bad_gene_index = adata.var.index
out2=()
for a in bad_gene_index:
    out2 +=a 

adata.var.index = out2 

adata.obs["Celltype"] = cell_type_read

row_attrs = { 
    "Gene": np.array(adata.var.index) ,
}
col_attrs = { 
    "CellID":  np.array(adata.obs.index) ,
    "Cell_type": np.array(adata.obs["Celltype"].values),
    "nGene": np.array( np.sum(adata.X.transpose()>0 , axis=0)).flatten() ,
    "nUMI": np.array( np.sum(adata.X.transpose() , axis=0)).flatten() ,
}

lp.create( unfiltered_loom_matrix , adata.X.transpose(), row_attrs, col_attrs )

This is the code I used to make sure the indices weren’t double tuples. After filtering following the steps in the tutorial inputting the filtered loom into the GRN step gives be the read-only error.

Any way around this would be great! Thanks!!

acjmmartin commented 4 years ago

I have now made the loom file directly from my seurat file using as.loom() and that worked :)

acjmmartin commented 4 years ago

Hi @cflerin I’ve kept getting this error even when using chmod 777 on my input loom and so looked more into it. Turns out the problem doesn’t lie with the input loom but with using dask.array with a distributed scheduler. According to "Tackle "ValueError: buffer source array is read-only" #1978” the error only shows up when the arrays transit across nodes or to the disk cache which is prevented by the scheduler if enough RAM and CPU power are available (which isn’t the case for large matrices). I have no idea how to fix this problem so any advice would still be welcome but thought I’d share this realisation.