create the mouse database with an additional motif

tropfenameimer commented 3 years ago

Hi @tropfenameimer, I need to create the mouse database with an additional motif. In order to run create_cisTarget_databases the procedure say that you need the motif collection and the FASTA file. I already have the motif collection including the new motif (in cb format) but I don't understand where to get the right FASTA file. My question is which FASTA file should I use it in order to run create_cisTarget_databases? In the gencode website (https://www.gencodegenes.org/mouse/release_M25.html) there are several files and the "Genome sequence (GRCm38.p6)" file doesn't have gene annotation (it contain just the chromosome annotation). Could you provide me the FASTA file you used in order to run it in the same reference?

Another question is about how to add this "direct motif" into the TF annotation file (in my case motifs-v9-nr.mgi-m0.001-o0.0.tbl).

Thank you in advance.

_Originally posted by @macsalvin in https://github.com/aertslab/create_cisTarget_databases/issues/3#issuecomment-815273690_

tropfenameimer commented 3 years ago

hi @macsalvin,

for the two mm10 databases, you can find bed files with the regions here: https://resources.aertslab.org/cistarget/regions/ (e.g. mm10-limited-upstream10000-tss-downstream10000-full-transcript.bed contains the regions used for mm10refseq-r8010kb_up_and_down_tss.mc9nr) you can make a fasta file from the bed file using 'bedtools getfasta'.

add the motif to the motif2tf table with its name in the 'motif_id' column, the associated TF in the 'gene_name' column, and set 'description' to "gene is directly annotated".

macsalvin commented 3 years ago

Thank you very much @tropfenameimer. I built the FASTA using your indications and it worked but there are still some issues for the integration between the original database and the one I created.

In the BED file there are different regions that refer to the same gene (Are these Isoforms?). So as a result the FASTA file and the final Database comes with genes repetition, see Fig.2 as example for a single gene.
The ranks for the one I generated are much higher then the original one, what could be the problem?
I tried to aggregate all the repetition by calculating the average but the final database contains less genes, Is this anyway the right approch? Could it be that some genes are excluded because I am running it for a single motif?
It would be better having all the Collection of Position Weight Matrices, adding mine and run create_cistarget_dabases. Do you have it available? Here what I found but they are not all https://resources.aertslab.org/papers/iregulon/motifColl-10k-all-public.tar.gz

Fig.1: This is part of the original Database (mm10refseq-r8010kb_up_and_down_tss.mc9nr.feather):

Total genes: 24130

Fig.2 This is part of the result for my single motif run:

Total genes after aggregation: 24103

Thank you so much for your help.

tropfenameimer commented 3 years ago

hi @macsalvin,

the different regions referring to the same gene are upstream regions, UTRs and introns. the script create_cistarget_motif_databases.py scores all regions, and then retains only the top scoring region per gene. but you have to specify the option --genes "#[0-9]+$" go get a db with gene names (i.e. without the "#[0-9]+" appended). so naturally, the ranks in the db you created are higher, because it is much larger.

when you are doubting that you might get different scores, you could make a database with a few test motifs that are also in our db (e.g. a few jaspar motifs), and compare the rankings. they might not be 100% identical though, because genes with the same or no scores could be switched.

unfortunately, we can't publish the full motif collection because it contains non-public motifs like the transfac pro collection. you can find an overview of the collections we use here: http://iregulon.aertslab.org/collections.html#motifcolldownload

macsalvin commented 3 years ago

Hi @tropfenameimer, I was able to merge the original mouse motif database (feather file) with the single motif that I created. Then, as you suggested I added the motif to the motif2tf table with its name in the 'motif_id' column, the associated TF in the 'gene_name' column, and set 'description' to "gene is directly annotated". After I run "pyscenic ctx" I was expecting to get more target genes for the specific TF but the results are the same as I run it with the regular mouse database. I also tried lowering the NES score (during the "pyscenic ctx" step) and yes I get more target genes but the result is still the same as it was running with the regular database. What step am I missing? Do you have any suggestion?

Thank you for your help

tropfenameimer commented 3 years ago

hi @macsalvin,

have you checked whether your motif is enriched? check in the output .csv table (option --output).
playing with the NES threshold is a good idea, but this will mainly influence the number of motifs in the output. if your motif is among the enriched ones, and you want to have more predicted target genes, you need to increase the rank threshold (option --rank_threshold). this threshold defaults to 1500 i.e. the top 1500 genes or regions in the motif ranking are considered to create the output target gene set, which is not much when you have a data base with many genes / regions. please try to increase it to e.g. 5000.

macsalvin commented 3 years ago

Hi @tropfenameimer, yes, I checked the file --output .cvs (the output from ctx step) and the motif is not in the file (This means that is not enriched). Which are the main criteria in order to have a motif enriched? I was expecting to have a better result because the gene is directly annotated and the ranks for all the genes related to this motif are higher (in general). If you have any other suggestions would be great.

Thank you in advance

aertslab / create_cisTarget_databases

create the mouse database with an additional motif #5