aertslab / create_cisTarget_databases

Create cisTarget databases
37 stars 8 forks source link

how to create a motif2TF tbl file #28

Closed shangguandong1996 closed 1 year ago

shangguandong1996 commented 1 year ago

Hi, Dear developer.

Because my species is plant, so it make no sense to use ortho gene to replace gene in human or mouse tbl file. And I also have different source motif file and want to merge these motif files. So I want to create a motif tbl file like https://resources.aertslab.org/cistarget/motif2tf/ using motif2TF procedure. But I do not find a tool or script in the origin motif2TF paper "iRegulon: From a Gene List to a Gene Regulatory Network Using Large Motif and Track Collections".

I am wondering whether you can give me some advice about this.

Thanks for your reply

Guandong Shang

ghuls commented 1 year ago

If you have motifs directly annotated to a TF, you can use lines like this:

#motif_id   motif_name  motif_description   source_name source_version  gene_name   motif_similarity_qvalue similar_motif_id    similar_motif_description   orthologous_identity    orthologous_gene_name   orthologous_species description
jaspar__MA0002.2    MA0002.2    RUNX1   jaspar  2016    RUNX1   0.000000    None    None    1.000000    None    None    gene is directly annotated

If you have motifs that are annotated to a TF in another species, they can be annotated via orthology.

#motif_id   motif_name  motif_description   source_name source_version  gene_name   motif_similarity_qvalue similar_motif_id    similar_motif_description   orthologous_identity    orthologous_gene_name   orthologous_species description
bergman__Su_H_  Su_H_   Su(H)   bergman 1.1     RBPJ    0.000000        None    None    0.722000        FBgn0004837     D. melanogaster gene is orthologous to FBgn0004837 in D. melanogaster (identity = 72%) which is directly annotated for motif

If you have unannotated motifs, you could run TomTom to see how similar they are to know motifs, to annotate them in that way:

#motif_id   motif_name  motif_description   source_name source_version  gene_name   motif_similarity_qvalue similar_motif_id    similar_motif_description   orthologous_identity    orthologous_gene_name   orthologous_species description
jaspar__MA0001.2    MA0001.2    AGL3    jaspar  2016    MEF2D   0.000211    taipale_cyt_meth__MEF2D_CCWWATWWRG_eDBD_meth    MEF2D [MADS, CpG-meth]  1.000000    None    None    motif similar to taipale_cyt_meth__MEF2D_CCWWATWWRG_eDBD_meth ('MEF2D [MADS, CpG-meth]'; q-value = 0.000211) which is directly annotated
shangguandong1996 commented 1 year ago

Thanks for your reply. I also have some questions:

  1. What Cluster-Buster format motif prefix should be used in motif anntation. motif_name or gene_name. MA0002.2.cb or RUNX1.cb?
  2. What if two motif_name are linked by same gene in annotation. How the SCENIC deal with it ?
  3. For example, My species have 1500 TF. But I only have 500 TF linked by motif in database, my TFlist should be 500 or 1500.
ghuls commented 1 year ago
  1. You can choose the names yourself. In our case: motif_id = motif_source__motif_name. So our Cluster-Buster motif files are called motif_source__motif_name.cb (jaspar__MA0002.2.cb with a header >jaspar__MA0002.2).
  2. This is not a problem. In general it is better to have more motifs per TF (assuming your motifs are at least slightly different) as one PWM in general does not capture the full TF binding capacity. e.g. for TP53 a combination of 6 motifs works better (correspondence with ChIP-seq peaks than just 1 motif.
  3. Your TF list can still be the full 1500 TF list, but for 1000 of them no enriched annotated motifs will be found after the grn step is done.
shangguandong1996 commented 1 year ago
  1. If I just choose the names by myself. For example, I can choose jasparMA0002.2.cb or MA0002.2.cb. Then my feature rank motif name is jasparMA0002.2 or MA0002.2? But my full TF list is RUNX1 and my scRNA Expression matrix name is also RUNX1. So how the pySCENIC know jaspar__MA0002.2 is linked to RUNX1? It will use the motif2TF tbl file to search ?
  2. You mean that it is better to have more motifs per TF. So I can make a tbl like below?
    
    #motif_id   motif_name  motif_description   source_name source_version  gene_name   motif_similarity_qvalue similar_motif_id    similar_motif_description   orthologous_identity    orthologous_gene_name   orthologous_species description
    jaspar__MA0002.2    MA0002.2    RUNX1   jaspar  2016    RUNX1   0.000000    None    None    1.000000    None    None    gene is directly annotated

cisbp__MA0002.2 MA0002.2 RUNX1 jaspar 2016 RUNX1 0.000000 None None 1.000000 None None gene is directly annotated



If these two cb file(jaspar__MA0002.2.cb and cisbp__MA0002.2.cb) have same **gene_name**, these two motif rank value will link to same TF during pySCENIC working?

3. I just curious what's the advantage or disadvantage for just using 500 TF list. After all, it seems that using 500 TF will save computer resource duing co-expression.

Thanks again for your detailed reply :)
ghuls commented 1 year ago
  1. Feature_rank motif name is the motif_id (Cluster-Buster filename without .cb).

  2. better to have more motifs per TF only makes sense when the motifs available for a certain TF are different (slightly different PWMS (e.g. different binding specificity in different cell types, or monomer PWM and dimer PWM). If you don't have different motifs for the same TF, just adding exactly the same motif, does not make sense. P.S.: JASPAR is a well curated collection and is in general included in a lot of other collections (CIS-BP, TRANSFAC, ...) so in that case those PWMs will be the same.

  3. And advantage of using 500TFs is that gene inference will likely be faster. A disadvantage might be that it will tell that a certain TF will regulate e.g 200 genes while when using 1500 TFs will reduce that list for that TF maybe to 150 genes as it assigned some of those genes to be regulated to different TFs for which you don't have motifs. So it might give you cleaner results. It also depends on the importance of your missing TFs and your dataset. Try both and see what works best.

shangguandong1996 commented 1 year ago

Thanks, I get it :)