atarashansky / SAMap

SAMap: Mapping single-cell RNA sequencing datasets from evolutionarily distant organisms.
MIT License
66 stars 19 forks source link

"ValueError: zero-size array to reduction operation" and other comments #114

Closed avianalter closed 1 year ago

avianalter commented 1 year ago

ERROR:

Value Error Traceback (most recent call last) Cell In[21], line 1 ----> 1 sm.run(pairwise=True) 2 samap = sm.samap

File /usr/local/lib/python3.8/dist-packages/samap/mapping.py:298, in SAMAP.run(self, NUMITERS, NHS, crossK, N_GENE_CHUNKS, umap, ncpus, hom_edge_thr, hom_edge_mode, scale_edges_by_corr, neigh_from_keys, pairwise) 294 neigh_from_keys[sid] = False 296 start_time = time.time() --> 298 smap.run( 299 NUMITERS=NUMITERS, 300 NHS=NHS, 301 K=crossK, 302 NCLUSTERS=N_GENE_CHUNKS, 303 ncpus=ncpus, 304 THR=hom_edge_thr, 305 corr_mode=hom_edge_mode, 306 scale_edges_by_corr = scale_edges_by_corr, 307 neigh_from_keys=neigh_from_keys, 308 pairwise=pairwise 309 ) 310 samap = smap.final_sam 311 self.samap = samap

File /usr/local/lib/python3.8/dist-packages/samap/mapping.py:700, in _Samap_Iter.run(self, NUMITERS, NHS, K, corr_mode, NCLUSTERS, scale_edges_by_corr, THR, neigh_from_keys, pairwise, ncpus) 697 self.GNNMS_corr.append(gnnmu) 698 self.gnnmu = gnnmu --> 700 gnnm2 = _get_pairs(sams, gnnmu, gns_dict, NOPs1=0, NOPs2=0) 701 self.GNNMS_pruned.append(gnnm2) 703 sam4 = _mapper( 704 sams, 705 gnnm2, (...) 714 pairwise=pairwise 715 )

File /usr/local/lib/python3.8/dist-packages/samap/mapping.py:1376, in _get_pairs(sams, gnnm, gns_dict, NOPs1, NOPs2) 1374 def _get_pairs(sams, gnnm, gns_dict, NOPs1=0, NOPs2=0): 1375 # gnnm = filter_gnnm(gnnm) -> 1376 su = gnnm.max(1).A 1377 su[su == 0] = 1 1378 gnnm = gnnm.multiply(1 / su).tocsr()

File /usr/local/lib/python3.8/dist-packages/scipy/sparse/_data.py:324, in _minmax_mixin.max(self, axis, out) 294 def max(self, axis=None, out=None): 295 """ 296 Return the maximum of the matrix or maximum along an axis. 297 This takes all elements into account, not just the non-zero ones. (...) 322 323 """ --> 324 return self._min_or_max(axis, out, np.maximum)

File /usr/local/lib/python3.8/dist-packages/scipy/sparse/_data.py:214, in _minmax_mixin._min_or_max(self, axis, out, min_or_max) 211 axis += 2 213 if (axis == 0) or (axis == 1): --> 214 return self._min_or_max_axis(axis, min_or_max) 215 else: 216 raise ValueError("axis out of range")

File /usr/local/lib/python3.8/dist-packages/scipy/sparse/_data.py:166, in _minmax_mixin._min_or_max_axis(self, axis, min_or_max) 164 N = self.shape[axis] 165 if N == 0: --> 166 raise ValueError("zero-size array to reduction operation") 167 M = self.shape[1 - axis] 169 mat = self.tocsc() if axis == 0 else self.tocsr()

ValueError: zero-size array to reduction operation

SOLUTION:

Basically what is happening is a discrepancy between feature names in your transcriptome file and your annData objects. This happened for me when pulling NCBI RefSeq data and using that both as a reference genome for cellranger's counting software (RefSeq $organism_genomic.fna and $organism_genomic.gtf files) as well as in 'map_genes.sh' (RefSeq $organism_cds_from_genomic.fna files). cellranger's features will be named using the 'geneid' field, but the pairwise tblastx hits will use the first field from the FASTA file's subject header - in the case of NCBI RefSeq data, that's the 'local' field (i.e. 'lcl|$whatever').

This can be solved pretty easily by using gffread to generate a properly formatted transcriptome for input into 'map_genes.sh.' It will look something like

awk '$3 != "gene"' '$refSeqPrefix'_genomic.gtf > '$yourPrefix'_nogenes.gtf
gffread -w '$yourPrefix'_transcripts.fa -g '$refSeqPrefix'_genomic.fna --table @geneid '$YourPrefix'_nogenes.gtf
sed 's/N.*\t//' '$yourPrefix'_transcripts.fa > '$yourPrefix'_transcripts_names.fa
sed 's/\_/\-/g' '$yourPrefix'_transcripts_names.fa > '$yourPrefix'_transcripts_dash.fa

Note: the last sed command is used to convert underscore '_' characters to dashes '-' because I used seurat and sceasy R packages to generate annData '$whatever.h5ad' files, and that process automatically converts underscores to dashes.