It's prefered to create a new environment for scButterfly
conda create -n scButterfly python==3.9
conda activate scButterfly
scButterfly is available on PyPI, and could be installed using
pip install scButterfly
Installation via Github is also provided
git clone https://github.com/Biox-NKU/scButterfly
cd scButterfly
pip install scButterfly-0.0.9-py3-none-any.whl
This process will take approximately 5 to 10 minutes, depending on the user's computer device and internet connectivition.
Illustrating with the translation between scRNA-seq and scATAC-seq data as an example, scButterfly could be easily used following 3 steps: data preprocessing, model training, predicting and evaluating. More details could be find in scButterfly documents.
Generate a scButterfly model first with following process:
from scButterfly.butterfly import Butterfly
butterfly = Butterfly()
Before data preprocessing, you should load the raw count matrix of scRNA-seq and scATAC-seq data via butterfly.load_data
:
butterfly.load_data(RNA_data, ATAC_data, train_id, test_id, validation_id)
Parameters | Description |
---|---|
RNA_data | AnnData object of shape n_obs × n_vars . Rows correspond to cells and columns to genes. |
ATAC_data | AnnData object of shape n_obs × n_vars . Rows correspond to cells and columns to peaks. |
train_id | A list of cell IDs for training. |
test_id | A list of cell IDs for testing. |
validation_id | An optional list of cell IDs for validation, if setted None, butterfly will use a default setting of 20% cells in train_id. |
Anndata object is a Python object/container designed to store single-cell data in Python packege anndata which is seamlessly integrated with scanpy, a widely-used Python library for single-cell data analysis.
For data preprocessing, you could use butterfly.data_preprocessing
:
butterfly.data_preprocessing()
You could save processed data or output process logging to a file using following parameters.
Parameters | Description |
---|---|
save_data | optional, choose save the processed data or not, default False. |
file_path | optional, the path for saving processed data, only used if save_data is True, default None. |
logging_path | optional, the path for output process logging, if not save, set it None, default None. |
scButterfly also support to refine this process using other parameters (more details on scButterfly documents), however, we strongly recommend the default settings to keep the best result for model.
Before model training, you could choose to use data augmentation strategy or not. If using data augmentation, scButterfly will generate synthetic samgles with the use of cell-type labels(if cell_type
in adata.obs
) or cluster labels get with Leiden algorithm and MultiVI, a single-cell multi-omics data joint analysis method in Python packages scvi-tools.
scButterfly provide data augmentation API:
butterfly.augmentation(aug_type)
You could choose parameter aug_type
from cell_type_augmentation
or MultiVI_augmentation
, this will cause more training time used, but promise better result for predicting.
cell_type_augmentation
, scButterfly-T (Type) will try to find cell_type
in adata.obs
. If failed, it will automaticly transfer to MultiVI_augmentation
.MultiVI_augmentation
, scButterfly-C (Cluster) will train a MultiVI model first.aug_type = None
.You could construct a scButterfly model as following:
butterfly.construct_model(chrom_list)
scButterfly need a list of peaks count for each chromosome, remember to sort peaks with chromosomes.
Parameters | Description |
---|---|
chrom_list | a list of peaks count for each chromosome, remember to sort peaks with chromosomes. |
logging_path | optional, the path for output model structure logging, if not save, set it None, default None. |
scButterfly model could be easily trained as following:
butterfly.train_model()
Parameters | Description |
---|---|
output_path | optional, path for model check point, if None, using './model' as path, default None. |
load_model | optional, the path for load pretrained model, if not load, set it None, default None. |
logging_path | optional, the path for output training logging, if not save, set it None, default None. |
scButterfly also support to refine the model structure and training process using other parameters for butterfly.construct_model()
and butterfly.train_model()
(more details on scButterfly documents).
scButterfly provide a predicting API, you could get predicted profiles as follow:
A2R_predict, R2A_predict = butterly.test_model()
A series of evaluating method also be integrated in this function, you could get these evaluation using parameters:
Parameters | Description |
---|---|
output_path | optional, path for model evaluating output, if None, using './model' as path, default None. |
load_model | optional, the path for load pretrained model, if not load, set it None, default False. |
model_path | optional, the path for pretrained model, only used if load_model is True, default None. |
test_cluster | optional, test the correlation evaluation or not, including AMI, ARI, HOM, NMI, default False. |
test_figure | optional, draw the tSNE visualization for prediction or not, default False. |
output_data | optional, output the prediction to file or not, if True, output the prediction to output_path/A2R_predict.h5ad and output_path/R2A_predict.h5ad , default False. |