bioinfo-biols / SEVtras

sEV-containing droplet identification in scRNA-seq data (SEVtras)
GNU Affero General Public License v3.0
16 stars 4 forks source link

issues about inport data in SEVtras #20

Open TATABOX99 opened 4 weeks ago

TATABOX99 commented 4 weeks ago

Dear Author,

I am using this script for analysis: import SEVtras SEVtras.sEV_recognizer(input_path='/home/yeziyang/Sc', sample_file='/home/yeziyang/Sc/Sc1_LN', out_path='/home/yeziyang/Sc/outputs', species='Homo')

My 10x_mtx formatted data is stored in this directory: /home/yeziyang/Sc/Sc1_LN/outs/raw_feature_bc_matrix/matrix.mtx.gz. I have ensured it is extracted from the raw_feature_bc_matrix. However, I encountered the following error: File "run_SEVtras.py", line 2, in SEVtras.sEV_recognizer(input_path='/home/yeziyang/Sc', sample_file='/home/yeziyang/Sc/Sc1_LN', out_path='/home/yeziyang/Sc/outputs', species='Homo') File "/home/yeziyang/miniconda3/envs/SEVtras_env/lib/python3.7/site-packages/SEVtras/main.py", line 79, in sEV_recognizer sample_log = get_sample(sample_file) File "/home/yeziyang/miniconda3/envs/SEVtras_env/lib/python3.7/site-packages/SEVtras/utils.py", line 18, in get_sample with open(sample_log, 'r') as f: IsADirectoryError: [Errno 21] Is a directory: '/home/yeziyang/Sc/Sc1_LN'

I am not sure why this error occurs. Could you please give me some advice? Thank you very much!

Additionally, when I used echo "Sc1_LN" > sample_file and ran the above code again, setting alpha to 0.08, it showed max() arg is an empty sequence. I am not sure if this is due to an error in my import process.

RuiqiaoHe commented 3 weeks ago

The parameter sample_file should be a file containing all samples, formatted as follows: sample1 sample2 ...... sampleN

In the case of your dataset, it would be: Sc1_LN Sc2_LN Sc3_LN ...... ScN_LN

For the "max() arg" error, please first check if your dataset was generated by scRNA-seq? SEVtras doesn't support single nucleus RNA-seq data. Then, you can try to lower the parameter alpha.

TATABOX99 commented 3 weeks ago

Thank you very much for your prompt response. The correct import command allowed the program to run successfully. Additionally, I saw in the guide that you suggest using multiple sample data to identify SEVs. Can I merge all my samples into one Seurat object, convert it to h5ad format, and then analyze it? If so, would clustering the cells in my object before using SEVtras affect the determination of the SEV sources?

TATABOX99 commented 3 weeks ago

By the way, I found that analyzing a single sample theoretically takes an entire day. Is this normal for this software? My RAM is 256G, with CPU core is 20.

RuiqiaoHe commented 3 weeks ago

Please refer to Question 9 in Troubleshooting. You can obtain the cell type information based on your own processing procedure, such as filtering or regressing. SEVtras only needs the cell type information, instead of the processed expression matrix. In my environment (naive Linux), SEVtras takes only a few minutes to finish a 10X single cell sample. You can try to speed it up by increasing the number of CPUs if the memory load is sufficient (predefine_threads).

TATABOX99 commented 3 weeks ago

Here I have an additional question. When perform data quality control, I use this code to remove poor-quality cells: test.seu <- subset(test.seu, subset = nFeature_RNA > 200 & nFeature_RNA < 6000 & percent.mt < 5). Will this step affect the identification of SEV (Single Extracellular Vesicles)?

RuiqiaoHe commented 3 weeks ago

The identification of sEVs is a separate step without the information of the cell matrix (SEVtras.sEV_recognizer). Cell matrix pre-processing only affects the calculation of ESAI (SEVtras.ESAI_calculator).

TATABOX99 commented 3 weeks ago

I have encountered some new difficulties. After completing the calculations in part 1, I obtained 15 "itera_gene.txt" and "raw_file.h5ad" files. Before running part 2, I noticed that you mentioned, "The first two parameters represent the path to sEV- and cell- anndata objects" and "With the output of SEVtras.sEV_recognizer in Part I sEVs recognizing and cell matrix with cell type, SEVtras can track each sEV to the original cell type and calculate the sEV secretion activity index (ESAI)."

However, the data I included in the first step is single-cell sequencing data directly from patients' original samples, without any clustering, meaning I do not have a cell matrix with cell type, and there is no "celltype" metadata. I would like to ask if this means I should first merge the raw_file.h5ad obtained from the first step into a Seurat object, process and cluster it, and then convert it back to an h5ad object to get the "test_cell.h5ad" file mentioned in "SEVtras.ESAI_calculator(adata_ev_path='./tests/sEV_SEVtras.h5ad', adata_cell_path='./tests/test_cell.h5ad', out_path='./outputs', Xraw=False, OBSsample='batch', OBScelltype='celltype')". But where does the "sEV_SEVtras.h5ad" file come from?

RuiqiaoHe commented 3 weeks ago

"sEV_SEVtras.h5ad" generated by SEVtras.sEV_recognizer, located in outputs directory.

TATABOX99 commented 2 weeks ago

Dear esteemed author,

In your guide, regarding the second part, you mentioned, "With the output of SEVtras.sEV_recognizer in Part I sEVs recognizing and cell matrix with cell type, SEVtras can track each sEV to original cell type and calculate sEV secretion activity index (ESAI)." Could you clarify whether the "cell matrix with cell type" referred to here should be constructed using the filtered feature bc matrix from single-cell data output or if it would be more appropriate to use the raw feature bc matrix to construct an object containing cell clustering information?

Thank you for your guidance.

RuiqiaoHe commented 2 weeks ago

Please refer to question 9 in Troubleshooting. A raw feature bc matrix with cell clustering information is preferred, regardless of how you generate the cell clustering information.

TATABOX99 commented 2 weeks ago

I am encountering an issue while running the SEVtras.ESAI_calculator function. I used a processed Seurat object which I converted to H5AD format as the tutorial in Troubleshooting. However, when I run the SEVtras.ESAI_calculator function, I encounter the following error:

/home/yeziyang/miniconda3/envs/SEVtras_env/lib/python3.7/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass AnnData(X, dtype=X.dtype, ...) to get the future behaviour. [AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas], /home/yeziyang/miniconda3/envs/SEVtras_env/lib/python3.7/site-packages/scanpy/preprocessing/_pca.py:229: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass AnnData(X, dtype=X.dtype, ...) to get the future behaviour. adata.obsm['X_pca'] = X_pca No matched cell sample for sEV deconvolution ... (repeated multiple times) ... Traceback (most recent call last): File "ESAI_calculator.py", line 2, in SEVtras.ESAI_calculator(adata_ev_path='/home/yeziyang/Sc/outputs/sEV_SEVtras.h5ad', adata_cell_path='/home/yeziyang/Sc/PTCALL_RAW.h5ad', out_path='/home/yeziyang/Sc/outputs/combined', Xraw=False, OBSsample='batch', OBScelltype='celltype') File "/home/yeziyang/miniconda3/envs/SEVtras_env/lib/python3.7/site-packages/SEVtras/main.py", line 188, in ESAI_calculator celltype_e_number, adata_evS, adata_com = deconvolver(adata_ev, adata_cell, species, OBSsample, OBScelltype, OBSev, OBSMpca, cellN, Xraw, normalW) File "/home/yeziyang/miniconda3/envs/SEVtras_env/lib/python3.7/site-packages/SEVtras/functional.py", line 131, in deconvolver near_neighbor_dat.index = adata_ev.obs.index File "/home/yeziyang/miniconda3/envs/SEVtras_env/lib/python3.7/site-packages/pandas/core/generic.py", line 5500, in setattr return object.setattr(self, name, value) File "pandas/_libs/properties.pyx", line 70, in pandas._libs.properties.AxisProperty.set File "/home/yeziyang/miniconda3/envs/SEVtras_env/lib/python3.7/site-packages/pandas/core/generic.py", line 766, in _set_axis self._mgr.set_axis(axis, labels) File "/home/yeziyang/miniconda3/envs/SEVtras_env/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 216, in set_axis self._validate_set_axis(axis, new_labels) File "/home/yeziyang/miniconda3/envs/SEVtras_env/lib/python3.7/site-packages/pandas/core/internals/base.py", line 58, in _validate_set_axis f"Length mismatch: Expected axis has {old_len} elements, new " ValueError: Length mismatch: Expected axis has 0 elements, new values have 31098 elements

Since I am a beginner in Python, I cannot understand what went wrong with the converted data. Additionally, it doesn't seem to be a problem with the samples, as I have confirmed that both adata_ev and adata_cell use the same samples. Below is the code I used in Python to convert a Seurat object to H5AD format. I greatly need your help, thank you!

import scanpy as sc import anndata from scipy import io from scipy.sparse import csr_matrix import pandas as pd

directory = '/home/yeziyang/Sc/'

X = io.mmread(directory + 'matrix.mtx') adata = anndata.AnnData(X=X.transpose().tocsr())

metadata = pd.read_csv(directory + 'metadata.csv')

with open(directory + 'gene_names.csv', 'r') as f: gene_names = f.read().splitlines()

adata.obs = metadata adata.obs.index = adata.obs["barcode"] adata.var.index = gene_names

adata.obs['celltype'] = pd.Categorical(adata.obs['celltype'].astype(str))

adata.write_h5ad(directory + 'PTCALL_RAW.h5ad')

TATABOX99 commented 2 weeks ago

adata_ev 'obs' columns: Index(['n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ribo', 'pct_counts_ribo', 'n_genes', 'score', 'batch', 'sEV', 'score1'], dtype='object')

adata_cell 'obs' columns: Index(['orig.ident', 'nCount_RNA', 'nFeature_RNA', 'barcode', 'percent.mt', 'patient', 'celltype', 'tissue_type', 'RNA_snn_res.1', 'seurat_clusters'], dtype='object') It looks like my adata_ev does not have a celltype obs. Is this normal? Did I overlook any important steps that caused the inconsistency between the two datasets?

RuiqiaoHe commented 2 weeks ago

Sorry for my fault. Please use following code: import scanpy as sc adata_cell = sc.read_h5ad(directory + 'PTCALL_RAW.h5ad') adata_cell.obs['batch'] = adata_cell.obs['orig.ident'] Since SEVtras gets sample information based on the OBSsample parameter, the default is batch.

TATABOX99 commented 2 weeks ago

Thank you very much for your suggestion. That part of the program is now running normally. However, an error occurred during the downstream analysis when converting to a Seurat object: Convert("SEVtras_combined.h5ad", dest = "h5seurat", overwrite = T) Warning: Unknown file type: h5ad Warning: 'assay' not set, setting to 'RNA' Creating h5Seurat file for version 3.1.5.9900 Adding X as data Adding raw/X as counts Adding meta.features from raw/var Adding X_pca as cell embeddings for pca Adding X_umap as cell embeddings for umap Adding PCs as feature loadings for pca Warning: Cannot find features for feature loadings, will not be able to load Adding miscellaneous information for pca Adding standard deviations for pca Adding miscellaneous information for umap Adding hvg to miscellaneous data Adding log1p to miscellaneous data

scdata <- LoadH5Seurat("SEVtras_combined.h5seurat", meta.data = T, misc = F) Validating h5Seurat file Initializing RNA with data Error in fixupDN.if.valid(value, x@Dim) : length of Dimnames[[1]] (21203) is not equal to Dim[1] (4542)

This seems to indicate an error in generating the data in h5ad format. I wonder if you have encountered this error before. I would greatly appreciate any further advice you could provide.

RuiqiaoHe commented 2 weeks ago

Please refer to question 13 at Troubleshooting html. The code would reliably convert Seurat object to h5ad format.

TATABOX99 commented 2 weeks ago

Dear Author,

Upon investigation, we found that the error originates from the version issue of the Seurat-disk we are using. It seems feasible to extract information from this h5ad format dataset directly in Python and then construct the Seurat object directly in R. However, we found that in the newly constructed Seurat object, the number of features and cells changed. Below is the information of the Seurat object we initially used:

testAB.integrated An object of class Seurat 31317 features across 161079 samples within 1 assay Active assay: RNA (31317 features, 5000 variable features) 3 layers present: counts, data, scale.data 4 dimensional reductions calculated: pca, harmony, umap, tsne

After processing with SEVtras, the number of features decreased to 4542, while the number of cells increased to 192177. We have two questions: first, is this a normal phenomenon that could occur during the computation process? Second, can we extract the EVs determination information in the metadata of the newly generated Seurat object and transfer it to the initial Seurat object, 'testAB.integrated', so that we can retain the EVs information while also preserving the feature information?

RuiqiaoHe commented 2 weeks ago

Since I don't know the full picture of your data, I can only answer with what I assume to be the case.

  1. The number of cells depends only on your input in parameter adata_cell of SEVtras.ESAI_calculator. So it won't increase. If you indicate the cell number in the output of SEVtras.sEV_recognizer (raw_SEVtras.h5ad), this is not the real cell in your analysis. I only control the total UMI in the file. As for the number of features, it can rise from different feature selection parameters between Seurat and Scanpy.
  2. It is ok to extract the sEV information in adata or Seurat object. You can just save them in the adata.obs and use the original data for other analysis.