Issue in flattened upper triangle of coexpression matrix

brianhie / trajectorama

Multi-study integration of cellular trajectories

http://trajectorama.csail.mit.edu

MIT License

18 stars 3 forks source link

Issue in flattened upper triangle of coexpression matrix #13

Closed faniafeby closed 4 years ago

faniafeby commented 4 years ago

Hello, I am a student and currently starting a scRNAseq project that includes Trajectorama for an analysis. I used a subsampled data of 3000 cells using sc.pp.subsample from a 3-time points dataset as my input for X (adata_red.X) and used adata_red.obs['sample'] as my 'studies' variables. For the analysis, I used the provided basic API, just made a slight change in this part:

X_coexpr = np.concatenate([
    Xs_coexpr_i[triu_idx].flatten() for Xs_coexpr_i in Xs_coexpr
])

However, I may have a problem with the shape of X_coexpr after concatenating of csr_matrix, which creates an object as stated below:

<1x558414780 sparse matrix of type '<class 'numpy.float64'>'
    with 152874082 stored elements in Compressed Sparse Row format>

and thus when I run into the Scanpy it has the error because n_comps = 0 and can't create the KNN-graph. Can you please help me to tackle this error? Thank you!

brianhie commented 4 years ago

There might be a problem with trying to np.concatenate sparse matrices? Could you try scipy.sparse.vstack instead?

faniafeby commented 4 years ago

I have tried the scipy.sparse.vstack but it showed an error message of ValueError: blocks must be 2-D. Instead, I tried using np.vstack, but the resulted X_coexpr matrix at the end is
<4x139603695 sparse matrix of type '<class 'numpy.float64'>' with 152874082 stored elements in Compressed Sparse Row format>
and when I generated the plot in Scanpy the resulted plot becomes funny. My nest question is, what is the expected result/matrix of the concatenated and the end matrix before entered into theAnnData? Thank you!

brianhie commented 4 years ago

The desired output is a # matrices by coexpression dimension matrix. I'll close the issue since it doesn't seem to be a bug with Trajectorama, but happy to answer more questions.

Also, perhaps you might try some other analyses before Trajectorama? If you are looking for integration methods, there are a few good tools out there like Scanorama. For trajectory methods, there's PAGA a number of others. Scanpy (http://scanpy.readthedocs.io/) has good tutorials on basic single-cell analysis.

faniafeby commented 4 years ago

The data I have is coming with 3 different time points with the total number of samples = ten. For my analysis, my supervisor recommended me to use Trajectorama without prior batch correction (Scanorama, ComBat, etc.) within each time point to preserve biological variance I think and aimed to make the trajectory out of the 3 different time points that come from 3 different studies. As my input, I used the adata.X matrice with 3000 cells (subsampled) and used the 'sample' annotation for my 'studies' input. My output (Xs_coexpr) is a list that consists of 4 arrays with shape 16.709 x 16.709 (same number as n_vars of my AnnData object). Is this as expected from Trajectorama? Thanks a lot!

brianhie commented 4 years ago

I see, makes sense. That output is expected, but I'd highly recommend two things: (1) give the algorithm all the cells and (2) restrict the analysis to the top ~1-2k highly variable genes. Also, there's currently a min_cluster_samples parameter that filters out clusters below 500 cells, which is probably removing all the clusters in your data. So maybe set that parameter to something lower like 50 or 100. The Xs_coexpr should have the number of rows equal to the number of coexpression matrices and the number columns equal to the number of genes squared.

Also, if you want a Trajectory with a single-cell output, then there are other standard trajectory tools to try like PAGA, etc. But these might result in discontinuity across timepoints.