bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
167 stars 17 forks source link

How to choose rownames to use for cellranger .h5 file #30

Closed Dario-Rocha closed 1 year ago

Dario-Rocha commented 1 year ago

Hello again,

When loading an .h5 file with Seurat function Read10X_h5, we can choose which annotation object to use for rownames, and, for example, use gene symbol instead of ENSG. Instead, when reading an .h5 file with BPCells function open_matrix_10x_hdf5, the matrix is loaded with ENSG as rownames. I am not familiar with .h5 files, and I can't find a way to read the desired gene symbol as rownames when loading the matrix with BPCells package

bnprks commented 1 year ago

Currently, there is no way to choose to read the gene symbols from within BPCells and it will always read the gene IDs (ENSG) for the row names. However, there are two options that are available to you

  1. There's a function canonical_gene_symbol() in BPCells, which can translate most gene IDs into their corresponding canonical symbol as defined by HGNC. I think not every gene in the 10x matrices has a canonical symbol, e.g. for some of the lncRNAs, but it should cover everything with a canonical name as of late 2022.
  2. You can manually read the gene symbols from the 10x file, then use rownames(bpcells_mat) <- gene_symbols to change the rownames on the BPCells object (you'll want to make sure you write the matrix to disk after setting the rownames). To do this in R, I'd recommend the hdf5r package. From the 10x documentation, the path in the 10x file you'd want is features/name

Hope one of those works for you! It might require reading a bit of documentation, but I highly recommend the hdf5r package, and hopefully it will be clear how to use -- the simple example on their github page is quite good. It's very useful to be able to dig around in hdf5 files yourself (the h5ls command line tool is also handy for looking at hdf5 file structure)

Dario-Rocha commented 1 year ago

Thank you for your help, I think I've made a lot of progress in understanding the idea behind h5 files and BPCells structure. I've managed to get the desired gene symbols from the h5 file, but I am failing to save the modified matrix. When using teh code below, the rownames will be the desired ones at the temp_data object in R, but when saving it with BPCells and reloading it, the rownames are the original ones. I am sorry if this is something quite basic, I've gone through the hdf5r and BPCells documentation and I can't really understand how to do this right.

#get cellranger rownames from h5 file
  temp_h5 <- H5File$new(temp_file, mode = 'r') #creates access to the file
  temp_symbol <- temp_h5[['matrix']][['features']][['name']][] #extract the gene names

  ###load 10x data with BPcells----
  temp_data <- open_matrix_10x_hdf5(temp_file, feature_type="Gene Expression")
  rownames(temp_data) <- temp_symbol
  write_matrix_dir(mat = temp_data, dir = temp_matrix_dir, overwrite = TRUE)
  temp_data <- open_matrix_dir(temp_matrix_dir)
bnprks commented 1 year ago

This looks like you're hitting issue #29, so I think it should be fixed if you re-install BPCells.

Dario-Rocha commented 1 year ago

Reinstalling BPCells solved this issue and another issue I was having when trying to work with h5 files generated by SoupX

bnprks commented 1 year ago

Great! I'll mark this as completed then