Running Matilda with expression matrix

vanhoan310 commented 1 year ago

I would like to run Matilda (classification). However I don't have the inputs in .h5 format. I only have gene expression matrix (in .csv), a ADT matrix (in .csv), and labels. Is an easy way to run Matilda using these inputs?

Thanks!

liuchunlei0430 commented 1 year ago

Thank you for your interest in Matilda. In R, you can read a '.csv' file and convert the matrix into a '.h5' format using the following function:

write_h5 <- function(exprs_list, h5file_list) {
  if (length(unique(lapply(exprs_list, rownames))) != 1) {
    stop("rownames of exprs_list are not identical.")
  }
  for (i in seq_along(exprs_list)) {
    if (file.exists(h5file_list[i])) {
      warning("h5file exists! will rewrite it.")
      system(paste("rm", h5file_list[i]))
    }
    h5createFile(h5file_list[i])
    h5createGroup(h5file_list[i], "matrix")
    writeHDF5Array(t((exprs_list[[i]])), h5file_list[i], name = "matrix/data")
    h5write(rownames(exprs_list[[i]]), h5file_list[i], name = "matrix/features")
    h5write(colnames(exprs_list[[i]]), h5file_list[i], name = "matrix/barcodes")
    print(h5ls(h5file_list[i]))
  }
}
write_h5(exprs_list = list(data = your_matrix),  h5file_list = c(saved_path))  # for example, saved_path is "./rna.h5"

When saving your data into an '.h5' format, make sure to replace 'your_matrix' with your actual data and 'saved_path' with the desired file path where you want to save the data.

Hope this help.

vanhoan310 commented 1 year ago

Thanks alot! which R library that supports the function writeHDF5Array?

liuchunlei0430 commented 1 year ago

library(HDF5Array)

You can install it using the following commands:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("HDF5Array")

vanhoan310 commented 1 year ago

Dear developers,

How can I use Matilda for training and classifying cell types on CITE-seq data (RNA + ADT)? I followed the tutorial and omitted the ---atac parameter but it does not work.

Thanks!

liuchunlei0430 commented 1 year ago

For Training CITE-seq data: python main_matilda_train.py --rna ../data/TEAseq/train_rna.h5 --adt ../data/TEAseq/train_adt.h5 --cty ../data/TEAseq/train_cty.csv #Training CITEseq

For classifying CITE-seq data: python main_matilda_task.py --rna ../data/TEAseq/test_rna.h5 --adt ../data/TEAseq/test_adt.h5 --cty ../data/TEAseq/test_cty.csv --classification True --query True # Classification for CITEseq

Hope this help.

vanhoan310 commented 1 year ago

Thanks for your fast reply. I followed your instruction but I got the following error.

Traceback (most recent call last): File "main_matilda_task.py", line 136, in atac_name = h5py.File(atac_data_path,"r")['matrix/features'][:] NameError: name 'atac_data_path' is not defined

liuchunlei0430 commented 1 year ago

Thanks for this, I have updated the codes as

rna_name  = h5py.File(rna_data_path,"r")['matrix/features'][:]
if args.adt != "NULL":
    adt_name  = h5py.File(adt_data_path,"r")['matrix/features'][:]
if args.atac!= "NULL":
    atac_name  = h5py.File(atac_data_path,"r")['matrix/features'][:]

You can re-download the codes to solve this problem.

vanhoan310 commented 1 year ago

It works now. Thanks alot.

hongfeiZhang-source commented 10 months ago

Hello, I downloaded the TEA-seq dataset mentioned in your article for replication purposes. However, based on the README file provided, I'm still unsure how to preprocess the data into a format suitable for inputting into the model. Could you please assist me by sharing the steps or procedures you used to prepare the data for model input?Thank you.

PYangLab / Matilda

Running Matilda with expression matrix #1