WPZgithub / CEFCON

Deciphering driver regulators of cell fate decisions from single-cell RNA-seq data
MIT License
19 stars 2 forks source link

Lineage information input in the command line usage of CEFCON #6

Open LiuCanidk opened 2 months ago

LiuCanidk commented 2 months ago

Hi, @WPZgithub I have a question on lineage information input in the command line usage of CEFCON. As I'm not familiar with the python, I prefer to use the command line tools for CEFCON. But if I input an expression matrix as a csv file, I realize I did not input the lineage information and found no other arguments if I specify the input_expData as the csv file purely.

I guess the information was included in the single cell object like the python package SCANPY AnnData object. But since the CEFCON offered the option to input the csv file, how can I input the lineage information with the csv file to construct the lineage specific GRN?

Many thanks if early reply can be received!

LiuCanidk commented 2 months ago

Besides, I also would like to ask the information input of differential gene expression. Can you specify the input format, like what is row and column information? Is there any matching rules between the input lineage file, differential gene expression file and the expression matrix file?

LiuCanidk commented 2 months ago

What‘s more, what is the input data format of expression matrix? raw count or normalized by seurat or scanpy? Can I input TPM or log(TP10K+1) data?

WPZgithub commented 2 months ago

I'm sorry for any inconvenience when using CEFCON. I've been taking the time lately to update some of the code as well as the previous readme file instructions to enhance ser-friendliness.

To briefly address your question, if you just use the command-line version, the input expression matrix is guaranteed to belong to a separate lineage. Thus, you may need to execute it individually for each lineage. Regarding the formats for the expression matrix and differential expression data, please refer to the example files in the 'example_data' folder. The gene order in the differential expression data file does not necessarily need to match the order in the expression matrix. For the data format of the expression matrix, in fact, as a deep learning-based method, the model can adaptively adjust itself according to the data. Therefore, the method does not impose limitations on the normalization technique employed for the data. However, I recommend using normalized data. In the CEFCON papers, we used log(TPM+1) for all the experiments.

Please let me know if any part requires further clarification or if you have additional queries.

LiuCanidk commented 2 months ago

Thanks for your detailed reply. @WPZgithub I have another question on the differential gene expression file. How can I get it? If I use the Seurat, is it just the result of FindAllMarker function and extract the specific genes for specific clusters (i.e., lineage)?

WPZgithub commented 2 months ago

I have provided the code script for obtaining differential expression information. Please refer to MAST_script.R, which uses the MAST method. Any other method for obtaining differentially expressed genes is acceptable, as long as you provide scores for the biologically significant genes (I used abs(logFoldChange) in the CEFCON paper). Please note that separate gene differential expression scores must be provided for each lineage. CEFCON can be run without providing differential expression information for genes, although I do not recommend it.

LiuCanidk commented 2 months ago

Sorry, I still cannot understand the meaning of the input differential gene expression. If I input only one lineage as a csv file, what are the comparing pairs for me to calculate the foldchange for genes? And if you use the abs(logFoldChange), why does not the direction or the sign of foldchange matter? Should I calculate the pseudotime as the MAST script shown?

LiuCanidk commented 2 months ago

And do I need to obtain the score (logFC) for all genes? or only need significant genes? or the differential genes needs to be the same as the expression matrix (that seems the case in the example data)?