Open tropfenameimer opened 3 years ago
hi @macsalvin,
for the two mm10 databases, you can find bed files with the regions here: https://resources.aertslab.org/cistarget/regions/ (e.g. mm10-limited-upstream10000-tss-downstream10000-full-transcript.bed contains the regions used for mm10refseq-r8010kb_up_and_down_tss.mc9nr) you can make a fasta file from the bed file using 'bedtools getfasta'.
add the motif to the motif2tf table with its name in the 'motif_id' column, the associated TF in the 'gene_name' column, and set 'description' to "gene is directly annotated".
Thank you very much @tropfenameimer. I built the FASTA using your indications and it worked but there are still some issues for the integration between the original database and the one I created.
Fig.1: This is part of the original Database (mm10refseq-r8010kb_up_and_down_tss.mc9nr.feather):
Total genes: 24130
Fig.2 This is part of the result for my single motif run:
Total genes after aggregation: 24103
Thank you so much for your help.
hi @macsalvin,
the different regions referring to the same gene are upstream regions, UTRs and introns.
the script create_cistarget_motif_databases.py
scores all regions, and then retains only the top scoring region per gene. but you have to specify the option --genes "#[0-9]+$"
go get a db with gene names (i.e. without the "#[0-9]+" appended).
so naturally, the ranks in the db you created are higher, because it is much larger.
when you are doubting that you might get different scores, you could make a database with a few test motifs that are also in our db (e.g. a few jaspar motifs), and compare the rankings. they might not be 100% identical though, because genes with the same or no scores could be switched.
unfortunately, we can't publish the full motif collection because it contains non-public motifs like the transfac pro collection. you can find an overview of the collections we use here: http://iregulon.aertslab.org/collections.html#motifcolldownload
Hi @tropfenameimer, I was able to merge the original mouse motif database (feather file) with the single motif that I created. Then, as you suggested I added the motif to the motif2tf table with its name in the 'motif_id' column, the associated TF in the 'gene_name' column, and set 'description' to "gene is directly annotated". After I run "pyscenic ctx" I was expecting to get more target genes for the specific TF but the results are the same as I run it with the regular mouse database. I also tried lowering the NES score (during the "pyscenic ctx" step) and yes I get more target genes but the result is still the same as it was running with the regular database. What step am I missing? Do you have any suggestion?
Thank you for your help
hi @macsalvin,
--output
).--rank_threshold
). this threshold defaults to 1500 i.e. the top 1500 genes or regions in the motif ranking are considered to create the output target gene set, which is not much when you have a data base with many genes / regions. please try to increase it to e.g. 5000.Hi @tropfenameimer, yes, I checked the file --output .cvs (the output from ctx step) and the motif is not in the file (This means that is not enriched). Which are the main criteria in order to have a motif enriched? I was expecting to have a better result because the gene is directly annotated and the ranks for all the genes related to this motif are higher (in general). If you have any other suggestions would be great.
Thank you in advance
Hi @tropfenameimer, I need to create the mouse database with an additional motif. In order to run create_cisTarget_databases the procedure say that you need the motif collection and the FASTA file. I already have the motif collection including the new motif (in cb format) but I don't understand where to get the right FASTA file. My question is which FASTA file should I use it in order to run create_cisTarget_databases? In the gencode website (https://www.gencodegenes.org/mouse/release_M25.html) there are several files and the "Genome sequence (GRCm38.p6)" file doesn't have gene annotation (it contain just the chromosome annotation). Could you provide me the FASTA file you used in order to run it in the same reference?
Another question is about how to add this "direct motif" into the TF annotation file (in my case motifs-v9-nr.mgi-m0.001-o0.0.tbl).
Thank you in advance.
_Originally posted by @macsalvin in https://github.com/aertslab/create_cisTarget_databases/issues/3#issuecomment-815273690_