Open romanhaa opened 5 years ago
Now that I was able to load the data, I'm asking myself what this matrix is. Shouldn't there have been spliced, unspliced, and ambiguous? Sorry, maybe I'm completely on the wrong track.
Hi Roman,
thanks for looking into it... it seems you are going to be the alpha-tester for this feature! :) Sorry for the lack of documentation, and for the breaking of the matrix market format. Basically, I dumped the spliced, unspliced and ambiguous counts as three columns in this file i.e. col3=spliced, col4=unspliced, col5=ambiguous
Cheers Alex
Ohhhh, I see. I guess in that case I can reproduce the sparse matrix like this:
t <- readr::read_delim('matrix.mtx', delim = ' ', skip = 3, col_names = FALSE)
spliced <- Matrix::sparseMatrix(
i = t$X1,
j = t$X2,
x = t$X3,
dims = c(60609, 6794880)
)
Matrix::writeMM(spliced, 'spliced.mtx')
unspliced <- Matrix::sparseMatrix(
i = t$X1,
j = t$X2,
x = t$X4,
dims = c(60609, 6794880)
)
Matrix::writeMM(unspliced, 'unspliced.mtx')
The matrix dimensions are taken from the header of the matrix.mtx
file. It's looking good so far. No idea what the scanpy method gave me then :D
I see high correlation between the results of velocyto (input was the Cell Ranger .bam
file) and STARsolo, both for spliced and unspliced counts. Less so for kallisto. There are some interesting differences in the number of transcripts and detected genes.
Number of transcripts:
Number of detected genes:
Correlation of spliced counts (by genes; 0's are set to 0.1 for log-scale):
Correlation of unspliced counts (by genes; 0's are set to 0.1 for log-scale):
Hi Roman,
thanks! Looks similar to what I was getting for STARsolo. I could not fully reproduce the velocyto logic, unfortunately. I was communicating with Gioele La Manno about it, but I guess we both got busy and dropped the ball. I am planning to start working on it again. If you make these results public, it will put a bit more pressure on us. :)
When I tried to run kallisto with velocyto calculation, it used 68GB of RAM. I am going to ask Lior about it on tweeter. :)
Cheers Alex
Yes, I'd like to know Lior's opinion as well. kallisto took a lot longer to run this time. I have no reliable numbers (will have to repeat) but kallisto took a significant hit in performance for this analysis. Generating the index alone took 2.5 hours (and it is 42GB in size). STARsolo didn't need a new index and finished the analysis in 1 hour with 20 threads (I get 9.5 CPU hours in total).
If you make these results public, it will put a bit more pressure on us.
I'll do my best but I'm not sure which form would be appropriate.
Cheers, Roman
Hi Roman,
If without velocyto STARsolo took 35 min, and with 1 hour - it seems reasonable. It's a bit slower than my expectation (~50% increase in time) - of course, the actual results depend strongly on the hardware. The velocyto counting is not parallelized yet, I hope to improve it in the future.
I am afraid to give advice about the form or forum. Tweeter discussions tend to become overheated... some people just cannot take others criticizing their work (not me : ) - I am happy to hear any kind of critique, as it always helps to improve the results). I think you may have enough material for a benchmarking paper of your own. It would probably be best if you contacted all Authors for optimal parameters etc. We can also have an e-mail discussion with Velocyto Authors. Gioele La Manno was the 1st Author, and Peter Kharchenko and Sten Linnarson the senior Authors.
Cheers Alex
Yes the run time is absolutely reasonable. Actually, for the spliced/unspliced counts STARsolo appears to be the fastest of all three methods. But further improvements would of course be appreciated :)
I appreciate your transparency and open-minded attitude towards these comparisons. Indeed, Twitter discussions are not ideal. I don't want to side with anybody and just present my perspective as a user. I'll have to think about the possibility of putting together a paper. Thanks for your suggestions and the offer to reach out to Gioele La Manno and the other authors of Velocyto. Optimizing the parameters for each tool would probably be an important thing to do before publishing performance comparisons.
Btw, what do you think about the difference in ratio (spliced/unspliced counts) between STARsolo vs Velocyto + kallisto? I found that to be quite striking.
Cheers, Roman
Hi Roman,
sorry for the belayed reply, I was swamped by the Genome Informatics conference.
I appreciate your transparency and open-minded attitude towards these comparisons. Indeed, Twitter discussions are not ideal. I don't want to side with anybody and just present my perspective as a user. I'll have to think about the possibility of putting together a paper. Thanks for your suggestions and the offer to reach out to Gioele La Manno and the other authors of Velocyto. Optimizing the parameters for each tool would probably be an important thing to do before publishing performance comparisons.
Absolutely! And I would welcome any feedback from the user perspective.
Btw, what do you think about the difference in ratio (spliced/unspliced counts) between STARsolo vs Velocyto + kallisto? I found that to be quite striking.
Indeed, the difference seems on the large sides. I have not yet looked at kallisto/bustools results carefully. I have made my comparisons with velocyto, and I remember that the total counts: spliced+unspliced+ambiguous were very close, but there was some exchange of counts between spliced and ambiguous. Just to clarify, by the number of transcripts you mean "number of molecules" - i.e. UMIs? And you are comparing the counts for the ~10k cells that were filtered by velocyto?
Cheers Alex
Just for reference/completion, here is a python way to do that:
import numpy as np
from scipy import sparse
mtx = np.loadtxt('matrix.mtx', skiprows=3, delimiter=' ')
spliced = sparse.coo_matrix((mtx[:,2], (mtx[:,0]-1, mtx[:,1]-1)), shape = (60558,737280))
unspliced = sparse.coo_matrix((mtx[:,3], (mtx[:,0]-1, mtx[:,1]-1)), shape = (60558,737280))
ambiguous = sparse.coo_matrix((mtx[:,4], (mtx[:,0]-1, mtx[:,1]-1)), shape = (60558,737280))
The shape (60558,737280)
is taken from the matrix header
Thanks to @msbentsen :-)
Thanks @jenzopr, I am still quite new to python, but was wondering how you merged this to the adata gene count matrix for velocity analysis using scVelo?
@Jachimdejonghe scvelo has a nice utility function for this: scvelo.utils.merge
, find the help here
@jenzopr Thanks ! To be more specific, did you manage to create a anndata file with both spliced and unspliced matrices as layers (to be later merged with the regular count matrix using the scvelo.utils.merge utility)? Or do you just merge twice (once with the spliced and once with the unspliced matrix)? Sorry if this is a stupid question but I haven't found an easy way to do this elegantly. I am trying to go through making a loom file via loompy but it doesn't work well at the moment. Any help is really appreciated, if of course you have been using scVelo :) . @alexdobin it would be really awesome if you would consider outputting the velocity results in a loom file with three layers: spliced, unspliced and ambiguous (like velocyto does). Of course not urgent but I would think this is useful for both Seurat and scVelo users.
@Jachimdejonghe using the AnnData constructor, you can initialise the object with all three layers:
adata = anndata.AnnData(layers = {'spliced': spliced, 'unspliced': unspliced, 'ambiguous': ambiguous})
Afterwards you can merge with an existing anndata. Is this what you were looking for? Ref: https://anndata.readthedocs.io/en/stable/anndata.AnnData.html#anndata.AnnData
Nice thanks a lot, yes I saw you could load those in the anndata object but still go some errors that I couldn't solve (because I am no good in python haha), so ultimately I went with the following (Taken the Gene/raw and Velocyto/raw files), with 55471,6909 taken from the matrix header:
adata = sc.read_10x_mtx('./STARSolo/Gene/raw') mtx = np.loadtxt('./STARSolo/Velocyto/raw/matrix.mtx', skiprows=3, delimiter=' ') adata.layers['spliced'] = sparse.coo_matrix((mtx[:,2], (mtx[:,0]-1, mtx[:,1]-1)), shape = (55471,6909)).tocsr().T adata.layers['unspliced'] = sparse.coo_matrix((mtx[:,3], (mtx[:,0]-1, mtx[:,1]-1)), shape = (55471,6909)).tocsr().T adata.layers['ambiguous'] = sparse.coo_matrix((mtx[:,4], (mtx[:,0]-1, mtx[:,1]-1)), shape = (55471,6909)).tocsr().T adata.write_loom("./STARSolo/STARSolo.loom",write_obsm_varm=True)
Certainly, this works as well. What I like about the scvelo.utils.merge
is, that it will get rid of the unmatched cells in spliced and unspliced. This way, you can use the Gene/filtered
matrices..
@Jachimdejonghe Thank you for your workaround. I managed to create a loom file successfully. However, when I load it into R to use it with Seurat, it gives me this error
reading loom file via hdf5r...
Error in `[[.H5File`(f, "col_attrs/CellID") :
An object with name col_attrs/CellID does not exist in this group
Any ideas on how to fix this ?
Just for reference/completion, here is a python way to do that:
import numpy as np from scipy import sparse mtx = np.loadtxt('matrix.mtx', skiprows=3, delimiter=' ') spliced = sparse.coo_matrix((mtx[:,2], (mtx[:,0]-1, mtx[:,1]-1)), shape = (60558,737280)) unspliced = sparse.coo_matrix((mtx[:,3], (mtx[:,0]-1, mtx[:,1]-1)), shape = (60558,737280)) ambiguous = sparse.coo_matrix((mtx[:,4], (mtx[:,0]-1, mtx[:,1]-1)), shape = (60558,737280))
The shape
(60558,737280)
is taken from the matrix header Thanks to @msbentsen :-)
I found this discussion very useful, I just wanted to add something to @jenzopr code
I would recomend to use csr_matrix
in place of coo_matrix
since the latter does not properly supports indexing and this would cause some errorrs when triyng to subset the object or running scvelo.utils.merge
Also, In place of manually specifing the shape of the matrix, I would extract it like that:
shape = np.loadtxt('matrix.mtx', skiprows=2, max_rows = 1, delimiter=' ')[0:2].astype(int)
Hi All,
the 2.7.8a release now outputs separate matrices for Velocyto spliced/unspliced/ambiguous counts. Not sure if this is very helpful, especially if you already have your pipelines running, but I thought that it would be a bit less confusing. If necessary, I could add an option to output the combined matrix.mtx.
Cheers Alex
Hi Alex, that's great and really helpful. I think adding the combined matrix is also quite useful.
Thanks a lot
@bapoorva It's the label issue.
https://github.com/yyoshiaki/mishima_gassyuku/blob/master/README.md
I followed this one.
import loompy
ds = loompy.connect("out.loom") ds.row_attrs['Gene'] = ds.row_attrs['var_names'] ds.col_attrs['CellID'] = ds.col_attrs['obs_names'] ds.close()
Then it can be imported by Seurat.
Hi All,
the 2.7.8a release now outputs separate matrices for Velocyto spliced/unspliced/ambiguous counts. Not sure if this is very helpful, especially if you already have your pipelines running, but I thought that it would be a bit less confusing. If necessary, I could add an option to output the combined matrix.mtx.
Cheers Alex
Might anyone have updated code for inputting the split matrices into anndata?
I've got the adata object set up but I'm struggling to add the separate Velocyto spliced/unspliced/ambiguous matrices from 2.7.9a as layers.
Update: solved with the help of ftucos 👍
Hi All, the 2.7.8a release now outputs separate matrices for Velocyto spliced/unspliced/ambiguous counts. Not sure if this is very helpful, especially if you already have your pipelines running, but I thought that it would be a bit less confusing. If necessary, I could add an option to output the combined matrix.mtx. Cheers Alex
Might anyone have updated code for inputting the split matrices into anndata?
I've got the adata object set up but I'm struggling to add the separate Velocyto spliced/unspliced/ambiguous matrices from 2.7.9a as layers.
Update: solved with the help of ftucos 👍
I'll post here the code the code we shared by email, maybe somebody could find it useful too.
You just need to point to the STARsolo output folder adata = buildAnndataFromStar('data/sample1.Solo.out/')
# Load required libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy
from scipy import sparse
import anndata
import scanpy as sc
import scvelo as scv
scv.set_figure_params()
os.chdir('/path/to/work/dir')
The following function is for older versions of STARsolo that generate only one sparse matrix with 3 value columns for spliced, unspliced and ambiguous.
def buildAnndataFromStar(path):
"""Generate an anndata object from the STAR aligner output folder"""
path=path
# Load Read Counts
X = sc.read_mtx(path+'Gene/raw/matrix.mtx')
# Transpose counts matrix to have Cells as rows and Genes as cols as expected by AnnData objects
X = X.X.transpose()
# This matrix is organized as a sparse matrix with Row, Cols and 3 values columns for
# Spliced, Unspliced and Ambigous reads
mtx = np.loadtxt(path+'Velocyto/raw/matrix.mtx', skiprows=3, delimiter=' ')
# Extract sparse matrix shape informations from the third row
shape = np.loadtxt(path+'Velocyto/raw/matrix.mtx', skiprows=2, max_rows = 1 ,delimiter=' ')[0:2].astype(int)
# Read the sparse matrix with csr_matrix((data, (row_ind, col_ind)), shape=(M, N))
# Subract -1 to rows and cols index because csr_matrix expects a 0 based index
# Traspose counts matrix to have Cells as rows and Genes as cols as expected by AnnData objects
spliced = sparse.csr_matrix((mtx[:,2], (mtx[:,0]-1, mtx[:,1]-1)), shape = shape).transpose()
unspliced = sparse.csr_matrix((mtx[:,3], (mtx[:,0]-1, mtx[:,1]-1)), shape = shape).transpose()
ambiguous = sparse.csr_matrix((mtx[:,4], (mtx[:,0]-1, mtx[:,1]-1)), shape = shape).transpose()
# Load Genes and Cells identifiers
obs = pd.read_csv(path+'Velocyto/raw/barcodes.tsv',
header = None, index_col = 0)
# Remove index column name to make it compliant with the anndata format
obs.index.name = None
var = pd.read_csv(path+'Velocyto/raw/features.tsv', sep='\t',
names = ('gene_ids', 'feature_types'), index_col = 1)
# Build AnnData object to be used with ScanPy and ScVelo
adata = anndata.AnnData(X = X, obs = obs, var = var,
layers = {'spliced': spliced, 'unspliced': unspliced, 'ambiguous': ambiguous})
adata.var_names_make_unique()
# Subset Cells based on STAR filtering
selected_barcodes = pd.read_csv(path+'Gene/filtered/barcodes.tsv', header = None)
adata = adata[selected_barcodes[0]]
return adata.copy()
After importing your andata object, you can eventually export it for a quicker load later.
adata.write('processed/sample_1.h5ad', compression = 'gzip')
adata = anndata.read_h5ad('processed/sample_1.h5ad')
Here you have the updated function from @JBreunig to load the 3 individual matrices generated in later versions of STARsolo
def buildAnndataFromStarCurr(path):
"""Generate an anndata object from the STAR aligner output folder"""
path=path
# Load Read Counts
X = sc.read_mtx(path+'Gene/raw/matrix.mtx')
# Transpose counts matrix to have Cells as rows and Genes as cols as expected by AnnData objects
X = X.X.transpose()
# Load the 3 matrices containing Spliced, Unspliced and Ambigous reads
mtxU = np.loadtxt(path+'Velocyto/raw/unspliced.mtx', skiprows=3, delimiter=' ')
mtxS = np.loadtxt(path+'Velocyto/raw/spliced.mtx', skiprows=3, delimiter=' ')
mtxA = np.loadtxt(path+'Velocyto/raw/ambiguous.mtx', skiprows=3, delimiter=' ')
# Extract sparse matrix shape informations from the third row
shapeU = np.loadtxt(path+'Velocyto/raw/unspliced.mtx', skiprows=2, max_rows = 1 ,delimiter=' ')[0:2].astype(int)
shapeS = np.loadtxt(path+'Velocyto/raw/spliced.mtx', skiprows=2, max_rows = 1 ,delimiter=' ')[0:2].astype(int)
shapeA = np.loadtxt(path+'Velocyto/raw/ambiguous.mtx', skiprows=2, max_rows = 1 ,delimiter=' ')[0:2].astype(int)
# Read the sparse matrix with csr_matrix((data, (row_ind, col_ind)), shape=(M, N))
# Subract -1 to rows and cols index because csr_matrix expects a 0 based index
# Traspose counts matrix to have Cells as rows and Genes as cols as expected by AnnData objects
spliced = sparse.csr_matrix((mtxS[:,2], (mtxS[:,0]-1, mtxS[:,1]-1)), shape = shapeS).transpose()
unspliced = sparse.csr_matrix((mtxU[:,2], (mtxU[:,0]-1, mtxU[:,1]-1)), shape = shapeU).transpose()
ambiguous = sparse.csr_matrix((mtxA[:,2], (mtxA[:,0]-1, mtxA[:,1]-1)), shape = shapeA).transpose()
# Load Genes and Cells identifiers
obs = pd.read_csv(path+'Velocyto/raw/barcodes.tsv',
header = None, index_col = 0)
# Remove index column name to make it compliant with the anndata format
obs.index.name = None
var = pd.read_csv(path+'Velocyto/raw/features.tsv', sep='\t',
names = ('gene_ids', 'feature_types'), index_col = 1)
# Build AnnData object to be used with ScanPy and ScVelo
adata = anndata.AnnData(X = X, obs = obs, var = var,
layers = {'spliced': spliced, 'unspliced': unspliced, 'ambiguous': ambiguous})
adata.var_names_make_unique()
# Subset Cells based on STAR filtering
selected_barcodes = pd.read_csv(path+'Gene/filtered/barcodes.tsv', header = None)
adata = adata[selected_barcodes[0]]
return adata.copy()
Thanks to @ftucos, I came up with this new way to implement our spliced, unspliced and ambiguous file to our data, just some very simple codes for the starters.
And many thanks for team STAR for this wonderful development so we don't have to generate a loom file for scVelo.
import scvelo as scv
import scanpy as sc
import sys
import numpy as np
from scipy import sparse
def creation_newadata(path):
'''load_adata is a repertory which contains all of our barcodes file, features file and matrix file'''
path=path
adata = sc.read_10x_mtx(path+'Gene/raw/')
spliced=np.loadtxt(path+'Velocyto/raw/spliced.mtx', skiprows=3, delimiter=' ')
shape = np.loadtxt(path+'Velocyto/raw/spliced.mtx', skiprows=2, max_rows = 1 ,delimiter=' ')[0:2].astype(int)
adata.layers['spliced']=sparse.csr_matrix((spliced[:,2], (spliced[:,0]-1, spliced[:,1]-1)), shape = (shape)).tocsr().T
unspliced=np.loadtxt(path+'Velocyto/raw/unspliced.mtx', skiprows=3, delimiter=' ')
adata.layers['unspliced']=sparse.csr_matrix((unspliced[:,2], (unspliced[:,0]-1, unspliced[:,1]-1)), shape = (shape)).tocsr().T
ambiguous= np.loadtxt(path+'Velocyto/raw/ambiguous.mtx', skiprows=3, delimiter=' ')
adata.layers['ambiguous']=sparse.csr_matrix((ambiguous[:,2], (ambiguous[:,0]-1, ambiguous[:,1]-1)), shape = (shape)).tocsr().T
return adata
With this data, we can use it for scanpy calculate as same as scVelo.
Cheers XIN
@Jachimdejonghe Thank you for your workaround. I managed to create a loom file successfully. However, when I load it into R to use it with Seurat, it gives me this error
reading loom file via hdf5r... Error in `[[.H5File`(f, "col_attrs/CellID") : An object with name col_attrs/CellID does not exist in this group
Any ideas on how to fix this ?
hi,@bapoorva I also have this question,have you fix it?
Hi,@ftucos , Thank you very much, buildAnndataFromStarCurr is very useful for generate anndata,but I need to combind Velocyto spliced/unspliced/ambiguous counts with Seurat data by Velocyto.R. So I need a to generate a loom file like Velocyto produced. Do you have any suggestions? Thanks a lot
Hi,@ftucos , Thank you very much, buildAnndataFromStarCurr is very useful for generate anndata,but I need to combind Velocyto spliced/unspliced/ambiguous counts with Seurat data by Velocyto.R. So I need a to generate a loom file like Velocyto produced. Do you have any suggestions? Thanks a lot
Hi, @ZJ-zuojing, I would recommend you use an HDF5 viewer to check the loom file generated and check that it's compliant with the loom file specification https://linnarssonlab.org/loompy/format/index.html From your error message, it seems your loom file is missing the "col_attrs" field.
Any faster loading than np.loadtxt
?
Hi Alex,
Following your question on Twitter I started a comparison for the spliced/unspliced counts from velocyto, STARsolo and kallisto. Unfortunately, when trying to read in the "Velocyto" matrix generated by STARsolo, I'm getting this error:
Any idea what the issue could be? The
mtx
file is ~1 GB in size.Using scanpy's
read_mtx
function works fine.For those running into the same problem, you can load the
.mtx
file generated by STARsolo and then save it again using scanpy/scipy functions as shown below. Afterwards, it will be readable byMatrix::readMM()
in R.Best, Roman