Roy-lab / scMTNI

18 stars 3 forks source link

Running scMTNI with prior matrix generated #6

Closed PeterZZQ closed 1 year ago

PeterZZQ commented 1 year ago

Hi,

I wish to run scMTNI with data where the prior matrix is already generated from the data. The input files that I have includes the gene expression data (gene by cell) and the prior GRN matrix (gene by gene) of each cell type. May I know how I can run the method with these two files? Thanks!

sap01 commented 1 year ago

Hi Peter,

Thank you for your interest in applying scMTNI! It is great that you have already generated the gene expression matrix (gene by cell) and the prior matrix (gene by gene). Since, scMTNI learns cell cluster-specific GRNs (cell cluster = a group of cells such as of the same cell type), it requires cell cluster-specific expression matrices (one expression matrix for each cell cluster) and a cell lineage (a tree that represents the relationships between the cell clusters). The prior matrices are optional inputs to scMTNI; again, we require cell-cluster specific prior matrices (one prior matrix per cell cluster); however, we can simply copy-paste the same prior matrix for all cell clusters. We can not do the same for expression matrices since each cell cluster-specific expression matrix is expected to contain a unique set of cells.

In case, you do not have the cell clusters, you may use any clustering algorithm to cluster your expression matrix into cell type-specific expression matrices; more accurately, cluster your cells based on their gene expression. Once you have the cell clusters, you can generate a complete undirected graph where each node represents a cell cluster and each edge (u, v) represents the Euclidean distance between the mean expression vectors of two cell clusters (a mean expression vector is a vector of length=#genes where the i-th element represents the mean expression of the i-th gene across all the cells in a specific cell cluster). Once you have obtained the complete undirected graph, you can calculate a minimum spanning tree (MST). When the MST has been generated, you can choose one of the nodes as the "root" and add directions to the edges simply based on the root. Suppose, your MST is A-B-C-D. If you choose "C" as your root, the resultant directed graph would look like A<-B<-C->D (sometimes, we use biological insights of the underlying system to choose the root node). This directed graph will be used as the cell lineage input tree to scMTNI. Please see https://github.com/Roy-lab/scMTNI/#parameter-explanations for how to input a cell lineage to scMTNI. Otherwise, if you prefer to not do cell clustering, you may use our previously proposed algorithm named MERLIN-P (https://github.com/roy-lab/merlin-p , https://academic.oup.com/nar/article/45/4/e21/2333925). If you choose to apply MERLIN-P, the two files you currently have will be sufficient and MERLIN-P will generate a single GRN.

Do you already have cell clusters?

Regards, Saptarshi

sap01 commented 1 year ago

Just wanted to add that scMTNI does not infer one GRN for each cell unlike CeSpGRN. scMTNI infers one GRN for each cell cluster. Hence, it requires the additional cell clustering step.

PeterZZQ commented 1 year ago

Thank you for the detailed explanation! Yes, I already have the cell clusters and can generate the lineage tree from that.

We designed a simulation tool and wish to run the method as it uses multi-modalities data to infer GRNs. Now that we have three types of txt files, including the cluster-specific gene expression matrix (gene by cell of the format .txt for each cell type), the cluster-specific prior network (gene by gene of the format .txt for each cell type), and the cell-cell lineage tree already in a .txt file. These are all that we have, but it seems that scMTNI also requires other files. I just wish to know how to format these files and write the command.

For example, in the lineage tree, you have parent cell and children cell, these two should be the two nodes of the edge, but what is Branch-specific gain rate and Branch-specific loss rate? I'm not sure how to set it.

In step 2, there is an ExampleData/regulators.txt file. May I know what should I store here? Do I have to provide all the TFs in advance? I thought that the information is included in the prior matrix. If I need to provide regulators, can I just use the regulators in the prior matrix?

In step 2, there are also two commands according to whether you have the prior matrix or not, but it seems that the commands are basically the same except --motifs parameter. The command also doesn't require the prior matrix as input. It seems that step 3 includes the prior matrix information and all the inference. I'm a little confused about the purpose of step 2. Can I just directly run step 3?

In step 3, I also have some questions about the filesExampleData/testdata_config.txt, ExampleData/TFs_OGs.txt, ExampleData/testdata_ogids.txt, etc:

image

ExampleData/TFs_OGs.txt and ExampleData/AllGenes0.txt seems to be about the ortho-group of genes. I'm confused about what I should provide here. I see a list of numbers, does it means the index of the gene in the data matrix? Do I have to provide it? There are ExampleData/AllGenes0.txt and ExampleData/AllGenes.txt, and they are both used. What is the difference here?

I also don't know how to set the ExampleData/testdata_ogids.txt. Sorry, there are a huge amount of questions. Thank you for your help!

PeterZZQ commented 1 year ago

I manage to run the code on our data, I set the Branch-specific gain rate and Branch-specific loss rate to be 0.2, I don't know if it is ok in our scenario. We have 3 clusters and the tree is linear, following 0->1->2. We provide the file celltype_tree_ancestor.txt as

cluster1    cluster0    0.2    0.2
cluster2    cluster1    0.2    0.2

When I ran the method, the terminal gives me

MetaLearner::start
precomputeEmptyGraphPrior
inputRegulatorOGs.size() = 6 inputOGList.size() = 110
ITERATION 0 newScore=-62303.9 diffscore=0 priorChange=0 successMove=0
MetaLearner::dumpAllGraphs
Final Score -62303.9
MetaLearner::start runtime: 11123 ms

And I check the result folder, it gives me fold0, but the files var_mb_pw_k50.txt in it are all empty. It seems that the graph is not successfully inferred? May I know what I can do here? Thanks!

image

I tried another data, it gives me

Split genes into ogs 0 -> 3
Segmentation fault (core dumped)

I guess it has something to do with the gene name, since we use simulated data, the genes are named following gene1, gene2, etc.

sap01 commented 1 year ago

Hi Peter,

Your celltype_tree_ancestor.txt is correct for the 0->1->2 tree. The gain and loss rates of 0.2 should work fine unless the underlying biological system has gone through a rapid change.

Can you please provide the complete command that you had used along with the values of the {-l, -m, -n} parameters?

Regards, Saptarshi

PeterZZQ commented 1 year ago

Hi Saptarshi,

It works right now. I used default values for all parameters. It was just the formatting issue of the filelist.txt that caused all the problems. Thank you!

sap01 commented 1 year ago

Hi Peter,

I'm glad that you made it work. Please feel free to let us know if any issues arise.

Regards, Saptarshi