aertslab / create_cisTarget_databases

Create cisTarget databases
37 stars 8 forks source link

build the Gallus gallus cisTarget database issue #4

Open honghh2018 opened 3 years ago

honghh2018 commented 3 years ago

Hi @ALL, i wish to build the Gallus gallus on my own, but i can really understand the input file, like below picture: image

where was the Gallus gallus motif file can be download ?
 I had check the http://jaspar.genereg.net/downloads/ web, but it was just the Vertebrates and so on but the  Gallus gallus.
 Besides, i can not understand where the lambert2018.txt come from ?
  Finally, the  fasta_filename="genome.fa" mean the genome.fasta file ?
  # i had try the command like below, but  it trigger error:
  create_cistarget_databases_dir="/share/nas1/Data/SoftWarePackage/tools/create_cisTarget_databases/create_cisTarget_databases"

fasta_filename="genome.fa" motifs_dir="motif" motifs_list_filename="lambert2018.txt" db_prefix="test"

nbr_threads=10 "${create_cistarget_databases_dir}/create_cistarget_motif_databases.py" \ -f "${fasta_filename}" \ -M "${motifs_dir}" \ -m "${motifs_list_filename}" \ -o "${db_prefix}" \ -t "${nbr_threads}" the error showing below:

Error: Cluster-Buster motif filename "/share/nas1/Data/SoftWarePackage/tools/create_cisTarget_databases/create_cisTarget_databases/motif/TFAP2A.cb" does not exist for motif TFAP2A. the motif directory including the motif.txt format like below: image

the motifs_list_filename file format like below:

image

the fasta file format like below: image

for this condition,

if i want to build the chicken cisTarget database, which file i should prepare ? and how can i re-run this script and make it successfully Any advice would be appreciated hanhuihong

honghh2018 commented 3 years ago

can anyone help?

tropfenameimer commented 3 years ago

hi @honghh2018, i see that we didn't explain clearly the idea behind the ranking databases.

say you wish to create a gene-based database. this is a motif x gene matrix, where for each motif, all genes are ranked according to a score that represents the presence of this motif in the regions associated with the gene.

first, you need for each gene one or several genomic regions to score. the easiest approach, especially for species with few genomic resources, is to take regions that are associated with a gene by proximity. you can for example go for the space upstream of the TSS, with or without introns, exons, UTRs or a downstream region. which delineation works best depends on your species of interest and will have to be tested. do you have a species with a rather compact genome and small intergenic space, set the upstream seach space to a few hundred to a few thousand kb. for other species several thousand kb might work. e.g. for drosophila, the default in i-cisTarget is 5kb upstream plus the full transcript, for human and mouse 10kb up- and 10kb downstream of the TSS. for chicken, you could download genomic regions 5kb upstream of all genes from UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/galGal6/bigZips/ (file upstream5000.fa.gz) or custom regions from the eukaryotic promoter database: https://epd.epfl.ch/G_gallus/G_gallus_database.php?db=G_gallus -> EPD selection tool -> choose G. gallus -> select -> set 'Sequence Extraction Tool' e.g. to -10000 and +10000 -> download by clicking on 'BED file' (make fasta file from bed file with 'bedtools getfasta')

note that the gene names in the fasta file will end up in the ranking db, and you'll have to use the same names in the input to cistarget or SCENIC. there can also be several regions per gene (e.g. when you have upstream regions + introns). in that case, the fasta has to contain numbered entries like this:

>HCLS1#1
TTTCAGCGATTTTATTTTCAATTCCAAGGTACTTTTTACAAAAAAAAATG
TATGCAAAATTGACAAACACTGTTACaattaaaaaaataaaaaaataaaa
>HCLS1#2
GCATGCTTGTCTGACTCACATTTTTATTTTGATTTAATTTTTTTAGATTT
>HCLS1#3
TCAACGTAGAAAGTATGTTTATCCAATTAGTGACTAAGATTATGTTCCCT
>ARSA#1
TAATGCATTTTACAAGTCTCAAGAAATCTCAACAAATTTATAGTTAGCAA
>ARSA#2
ATGTGCTTCGCACTTTGGAATAGTAGAAATGTGGGGCGGGTGGGTGGGAA
ACCAACACGTAGAATGATGACAAAACGCCGCTGCGGCCGAGGAAAGATTC

'create_cistarget_motif_databases.py' will then keep the maximum score per gene.

second, you need a set of motifs. they do not necessarily need to be chicken-specific. TF binding specificities have been shown to be conserved across species. so using the e.g. the jaspar vertebrate collection is a good start for chicken. you can check the collections we use here: http://iregulon.aertslab.org/collections.html#motifcolldownload

the motifs need to be in clusterbuster format and saved as individual files in the ${motifs_dir} (cbust format: see requirements here https://orca.bu.edu/page/ClusterBuster_download). you can convert between different motif formats with the Bio.motifs module from BioPython. here is an example for jaspar motifs:

wget http://jaspar.genereg.net/download/CORE/JASPAR2020_CORE_vertebrates_non-redundant_pfms_jaspar.txt 

and then in python:

>>> from Bio import motifs
>>> import os
>>> from pathlib import Path

>>> with open("JASPAR2020_CORE_vertebrates_non-redundant_pfms_jaspar.txt", 'rt') as fh:
        jms = motifs.parse(fh, "JASPAR")
>>> print(jms[0].matrix_id)
MA0006.1
>>> print(jms[0].name)
Ahr::Arnt
>>> print(jms[0].counts)
        0      1      2      3      4      5
A:   3.00   0.00   0.00   0.00   0.00   0.00
C:   8.00   0.00  23.00   0.00   0.00   0.00
G:   2.00  23.00   0.00  23.00   0.00  24.00
T:  11.00   1.00   1.00   1.00  24.00   0.00

>>> print(format(jms[0], "clusterbuster"))
>Ahr::Arnt
3       8       2       11
0       0       23      1
0       23      0       1
0       0       23      1
0       0       0       24
0       0       24      0

>>> outdir = Path('motifs_cb_format')
>>> os.mkdir(outdir)

>>> for jm in jms:
        motiffile = outdir.joinpath(str(jm.matrix_id + '.cb'))
        with open(motiffile, 'wt') as fh:
                fh.write(format(jm, "clusterbuster"))

-> this will save all motifs individually in a folder 'motifs_cb_format' with the name '[matrix_id].cb' you'll also need a file with motifs you want to score (so you can take a subset of the motifs you have):

ls motifs_cb_format/ | sed 's/\.cb//g' >motifs.lst 

if you want to use the motif2tf database, that links motifs to TFs, you can create a custom db for your species. download the motif2TF file for human, mouse or fly here: https://resources.aertslab.org/cistarget/motif2tf/ and replace the human, mouse or fly gene symbols by homologous genes from your species. note that the motif names in this file also have to match the names of the motifs in your ranking db.

hope that helps.

honghh2018 commented 3 years ago

Thanks the good advices, i had follow your step to try to build the Gallus gallus cisTarget Database, until now, we had not found the error as well, and then i would continue to go further. if without a hitch, the gallus database would be create.

zealotma commented 3 years ago

hi @honghh2018, i see that we didn't explain clearly the idea behind the ranking databases.

say you wish to create a gene-based database. this is a motif x gene matrix, where for each motif, all genes are ranked according to a score that represents the presence of this motif in the regions associated with the gene.

first, you need for each gene one or several genomic regions to score. the easiest approach, especially for species with few genomic resources, is to take regions that are associated with a gene by proximity. you can for example go for the space upstream of the TSS, with or without introns, exons, UTRs or a downstream region. which delineation works best depends on your species of interest and will have to be tested. do you have a species with a rather compact genome and small intergenic space, set the upstream seach space to a few hundred to a few thousand kb. for other species several thousand kb might work. e.g. for drosophila, the default in i-cisTarget is 5kb upstream plus the full transcript, for human and mouse 10kb up- and 10kb downstream of the TSS. for chicken, you could download genomic regions 5kb upstream of all genes from UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/galGal6/bigZips/ (file upstream5000.fa.gz) or custom regions from the eukaryotic promoter database: https://epd.epfl.ch/G_gallus/G_gallus_database.php?db=G_gallus -> EPD selection tool -> choose G. gallus -> select -> set 'Sequence Extraction Tool' e.g. to -10000 and +10000 -> download by clicking on 'BED file' (make fasta file from bed file with 'bedtools getfasta')

note that the gene names in the fasta file will end up in the ranking db, and you'll have to use the same names in the input to cistarget or SCENIC. there can also be several regions per gene (e.g. when you have upstream regions + introns). in that case, the fasta has to contain numbered entries like this:

>HCLS1#1
TTTCAGCGATTTTATTTTCAATTCCAAGGTACTTTTTACAAAAAAAAATG
TATGCAAAATTGACAAACACTGTTACaattaaaaaaataaaaaaataaaa
>HCLS1#2
GCATGCTTGTCTGACTCACATTTTTATTTTGATTTAATTTTTTTAGATTT
>HCLS1#3
TCAACGTAGAAAGTATGTTTATCCAATTAGTGACTAAGATTATGTTCCCT
>ARSA#1
TAATGCATTTTACAAGTCTCAAGAAATCTCAACAAATTTATAGTTAGCAA
>ARSA#2
ATGTGCTTCGCACTTTGGAATAGTAGAAATGTGGGGCGGGTGGGTGGGAA
ACCAACACGTAGAATGATGACAAAACGCCGCTGCGGCCGAGGAAAGATTC

'create_cistarget_motif_databases.py' will then keep the maximum score per gene.

second, you need a set of motifs. they do not necessarily need to be chicken-specific. TF binding specificities have been shown to be conserved across species. so using the e.g. the jaspar vertebrate collection is a good start for chicken. you can check the collections we use here: http://iregulon.aertslab.org/collections.html#motifcolldownload

the motifs need to be in clusterbuster format and saved as individual files in the ${motifs_dir} (cbust format: see requirements here https://orca.bu.edu/page/ClusterBuster_download). you can convert between different motif formats with the Bio.motifs module from BioPython. here is an example for jaspar motifs:

wget http://jaspar.genereg.net/download/CORE/JASPAR2020_CORE_vertebrates_non-redundant_pfms_jaspar.txt 

and then in python:

>>> from Bio import motifs
>>> import os
>>> from pathlib import Path

>>> with open("JASPAR2020_CORE_vertebrates_non-redundant_pfms_jaspar.txt", 'rt') as fh:
        jms = motifs.parse(fh, "JASPAR")
>>> print(jms[0].matrix_id)
MA0006.1
>>> print(jms[0].name)
Ahr::Arnt
>>> print(jms[0].counts)
        0      1      2      3      4      5
A:   3.00   0.00   0.00   0.00   0.00   0.00
C:   8.00   0.00  23.00   0.00   0.00   0.00
G:   2.00  23.00   0.00  23.00   0.00  24.00
T:  11.00   1.00   1.00   1.00  24.00   0.00

>>> print(format(jms[0], "clusterbuster"))
>Ahr::Arnt
3       8       2       11
0       0       23      1
0       23      0       1
0       0       23      1
0       0       0       24
0       0       24      0

>>> outdir = Path('motifs_cb_format')
>>> os.mkdir(outdir)

>>> for jm in jms:
        motiffile = outdir.joinpath(str(jm.matrix_id + '.cb'))
        with open(motiffile, 'wt') as fh:
                fh.write(format(jm, "clusterbuster"))

-> this will save all motifs individually in a folder 'motifs_cb_format' with the name '[matrix_id].cb' you'll also need a file with motifs you want to score (so you can take a subset of the motifs you have):

ls motifs_cb_format/ | sed 's/\.cb//g' >motifs.lst 

if you want to use the motif2tf database, that links motifs to TFs, you can create a custom db for your species. download the motif2TF file for human, mouse or fly here: https://resources.aertslab.org/cistarget/motif2tf/ and replace the human, mouse or fly gene symbols by homologous genes from your species. note that the motif names in this file also have to match the names of the motifs in your ranking db.

hope that helps.

Thanks for your detailed explanation! It is very helpful. Could you also pls guid me if I want to create a track-based database? I have the region fasta file and 100+ chip-seq data in bigwig format. However, I've no idea how to organize and where should I input these data under create_cistarget_motif_databases.py. Many thanks!

BioAIEvolu commented 3 years ago

Hi, @tropfenameimer According to your very helpful suggestion above, I have tried success partially to build the Gallus gallus cisTarget Database( .cross_species.motifs_vs_regions.rankings.feather & .cross_species.regions_vs_motifs.rankings.feather ). I use upstream5000.fa.gz data in running create_cistarget_motif_databases.py

09:50 luohb@Master:/share/disk1/Data/Users/luohb/temp/honghh/test/NicheNet3/fasta 
$grep ">" gene.fa |wc -l
14344
09:50 luohb@Master:/share/disk1/Data/Users/luohb/temp/honghh/test/NicheNet3/fasta 
$grep ">" gene.fa |head
>GOLGB1#1
>HCLS1#1
>OR9Q1#1
>NME6#1
>RABL2B#1
>ACR#1
>ARSA#1
>SHANK3#1
>gga-mir-6708#1
>MIR1604#1

Then, running follow steps:

create_cistarget_databases_dir='/share/disk1/Data/Users/luohb/temp/honghh/test/NicheNet/create_cisTarget_databases'
fasta_filename="/share/disk1/Data/Users/luohb/temp/honghh/test/NicheNet3/fasta/gene.fa"
motifs_dir="/share/disk1/Data/Users/luohb/temp/honghh/test/NicheNet/create_cisTarget_databases/motifs_cb_format/"
motifs_list_filename="/share/disk1/Data/Users/luohb/temp/honghh/test/NicheNet/create_cisTarget_databases/motifs.lst"
db_prefix="test"
original_species_fasta_filename='/share/disk1/Data/Users/luohb/temp/honghh/test/NicheNet3/fasta/gene.fa'
nbr_threads=60

python ${create_cistarget_databases_dir}/create_cistarget_motif_databases.py \
    -f "${fasta_filename}" \
    -F "${original_species_fasta_filename}" \
    -M "${motifs_dir}" \
    -m "${motifs_list_filename}" \
    -o "${db_prefix}" \
    -t "${nbr_threads}"

create_cistarget_databases_dir="/share/disk1/Data/Users/luohb/temp/honghh/test/NicheNet3/create_cisTarget_databases"

cp test.motifs_vs_regions.rankings.feather test.test.motifs_vs_regions.rankings.feather

# cisTarget database prefix which matches the common part of all cisTarget rankings databases (or just the directory
# that contains them).
db_prefix="test"

# Output directory.
output_dir="/share/disk1/Data/Users/luohb/temp/honghh/test/NicheNet3/databases/"

python ${create_cistarget_databases_dir}/create_cross_species_motifs_rankings_db.py \
    -i "${db_prefix}" \
    -o "${output_dir}"

And finally get the result feather file:

-rw-rw-r-- 1 luohb luohb  21M Jun  1 15:46 test.cross_species.motifs_vs_regions.rankings.feather
-rw-rw-r-- 1 luohb luohb  22M Jun  1 15:44 test.cross_species.regions_vs_motifs.rankings.feather
-rw-rw-r-- 1 luohb luohb  21M Jun  1 15:48 test.motifs_vs_regions.rankings.feather
-rw-rw-r-- 1 luohb luohb  42M Jun  1 15:48 test.motifs_vs_regions.scores.feather
-rw-rw-r-- 1 luohb luohb  22M Jun  1 15:48 test.regions_vs_motifs.rankings.feather
-rw-rw-r-- 1 luohb luohb  42M Jun  1 15:48 test.regions_vs_motifs.scores.feather
-rw-r--r-- 1 luohb luohb  21M Jun  1 15:46 test.test.motifs_vs_regions.rankings.feath

Are these feather files are too small ?Maybe I miss something, but I can't figure it out. Hope for your reply^_^

ghuls commented 3 years ago

The size of the feather files depends on the number of motifs used multiplied by the number of genes/regions used multiplied by 2 or 4 bytes (depending if number of genes/regions < 2^15 or not).

2100000 bytes / 14344 genes / 2 bytes per ranking = 732 motifs
honghh2018 commented 3 years ago

My issue resemble @BioAIEvolu, the size of .feather file was so not big as what i expect like human or mouse's . So whether were right ? the output files of chicken cisTarget file merely with 22 Mb was so samller than human file in 50 times or so. If it were uncorrection, i would be appreciated hear the kindly advices from you. Best, hanhuihong

ghuls commented 3 years ago

In human and mouse we have 27k and 24k genes respectively and 24k motifs:

hg38__refseq-r80__10kb_up_and_down_tss.mc9nr.feather: 1327758248 bytes

27165 genes * 24324 motifs * 2 bytes = 1321522920 bytes (for the rankings) + some for gene and motif names.
ctlshcxy commented 2 years ago

hi @honghh2018, i see that we didn't explain clearly the idea behind the ranking databases.

say you wish to create a gene-based database. this is a motif x gene matrix, where for each motif, all genes are ranked according to a score that represents the presence of this motif in the regions associated with the gene.

first, you need for each gene one or several genomic regions to score. the easiest approach, especially for species with few genomic resources, is to take regions that are associated with a gene by proximity. you can for example go for the space upstream of the TSS, with or without introns, exons, UTRs or a downstream region. which delineation works best depends on your species of interest and will have to be tested. do you have a species with a rather compact genome and small intergenic space, set the upstream seach space to a few hundred to a few thousand kb. for other species several thousand kb might work. e.g. for drosophila, the default in i-cisTarget is 5kb upstream plus the full transcript, for human and mouse 10kb up- and 10kb downstream of the TSS. for chicken, you could download genomic regions 5kb upstream of all genes from UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/galGal6/bigZips/ (file upstream5000.fa.gz) or custom regions from the eukaryotic promoter database: https://epd.epfl.ch/G_gallus/G_gallus_database.php?db=G_gallus -> EPD selection tool -> choose G. gallus -> select -> set 'Sequence Extraction Tool' e.g. to -10000 and +10000 -> download by clicking on 'BED file' (make fasta file from bed file with 'bedtools getfasta')

note that the gene names in the fasta file will end up in the ranking db, and you'll have to use the same names in the input to cistarget or SCENIC. there can also be several regions per gene (e.g. when you have upstream regions + introns). in that case, the fasta has to contain numbered entries like this:

>HCLS1#1
TTTCAGCGATTTTATTTTCAATTCCAAGGTACTTTTTACAAAAAAAAATG
TATGCAAAATTGACAAACACTGTTACaattaaaaaaataaaaaaataaaa
>HCLS1#2
GCATGCTTGTCTGACTCACATTTTTATTTTGATTTAATTTTTTTAGATTT
>HCLS1#3
TCAACGTAGAAAGTATGTTTATCCAATTAGTGACTAAGATTATGTTCCCT
>ARSA#1
TAATGCATTTTACAAGTCTCAAGAAATCTCAACAAATTTATAGTTAGCAA
>ARSA#2
ATGTGCTTCGCACTTTGGAATAGTAGAAATGTGGGGCGGGTGGGTGGGAA
ACCAACACGTAGAATGATGACAAAACGCCGCTGCGGCCGAGGAAAGATTC

'create_cistarget_motif_databases.py' will then keep the maximum score per gene.

second, you need a set of motifs. they do not necessarily need to be chicken-specific. TF binding specificities have been shown to be conserved across species. so using the e.g. the jaspar vertebrate collection is a good start for chicken. you can check the collections we use here: http://iregulon.aertslab.org/collections.html#motifcolldownload

the motifs need to be in clusterbuster format and saved as individual files in the ${motifs_dir} (cbust format: see requirements here https://orca.bu.edu/page/ClusterBuster_download). you can convert between different motif formats with the Bio.motifs module from BioPython. here is an example for jaspar motifs:

wget http://jaspar.genereg.net/download/CORE/JASPAR2020_CORE_vertebrates_non-redundant_pfms_jaspar.txt 

and then in python:

>>> from Bio import motifs
>>> import os
>>> from pathlib import Path

>>> with open("JASPAR2020_CORE_vertebrates_non-redundant_pfms_jaspar.txt", 'rt') as fh:
        jms = motifs.parse(fh, "JASPAR")
>>> print(jms[0].matrix_id)
MA0006.1
>>> print(jms[0].name)
Ahr::Arnt
>>> print(jms[0].counts)
        0      1      2      3      4      5
A:   3.00   0.00   0.00   0.00   0.00   0.00
C:   8.00   0.00  23.00   0.00   0.00   0.00
G:   2.00  23.00   0.00  23.00   0.00  24.00
T:  11.00   1.00   1.00   1.00  24.00   0.00

>>> print(format(jms[0], "clusterbuster"))
>Ahr::Arnt
3       8       2       11
0       0       23      1
0       23      0       1
0       0       23      1
0       0       0       24
0       0       24      0

>>> outdir = Path('motifs_cb_format')
>>> os.mkdir(outdir)

>>> for jm in jms:
        motiffile = outdir.joinpath(str(jm.matrix_id + '.cb'))
        with open(motiffile, 'wt') as fh:
                fh.write(format(jm, "clusterbuster"))

-> this will save all motifs individually in a folder 'motifs_cb_format' with the name '[matrix_id].cb' you'll also need a file with motifs you want to score (so you can take a subset of the motifs you have):

ls motifs_cb_format/ | sed 's/\.cb//g' >motifs.lst 

if you want to use the motif2tf database, that links motifs to TFs, you can create a custom db for your species. download the motif2TF file for human, mouse or fly here: https://resources.aertslab.org/cistarget/motif2tf/ and replace the human, mouse or fly gene symbols by homologous genes from your species. note that the motif names in this file also have to match the names of the motifs in your ranking db.

hope that helps.

hi @honghh2018, i see that we didn't explain clearly the idea behind the ranking databases.

say you wish to create a gene-based database. this is a motif x gene matrix, where for each motif, all genes are ranked according to a score that represents the presence of this motif in the regions associated with the gene.

first, you need for each gene one or several genomic regions to score. the easiest approach, especially for species with few genomic resources, is to take regions that are associated with a gene by proximity. you can for example go for the space upstream of the TSS, with or without introns, exons, UTRs or a downstream region. which delineation works best depends on your species of interest and will have to be tested. do you have a species with a rather compact genome and small intergenic space, set the upstream seach space to a few hundred to a few thousand kb. for other species several thousand kb might work. e.g. for drosophila, the default in i-cisTarget is 5kb upstream plus the full transcript, for human and mouse 10kb up- and 10kb downstream of the TSS. for chicken, you could download genomic regions 5kb upstream of all genes from UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/galGal6/bigZips/ (file upstream5000.fa.gz) or custom regions from the eukaryotic promoter database: https://epd.epfl.ch/G_gallus/G_gallus_database.php?db=G_gallus -> EPD selection tool -> choose G. gallus -> select -> set 'Sequence Extraction Tool' e.g. to -10000 and +10000 -> download by clicking on 'BED file' (make fasta file from bed file with 'bedtools getfasta')

note that the gene names in the fasta file will end up in the ranking db, and you'll have to use the same names in the input to cistarget or SCENIC. there can also be several regions per gene (e.g. when you have upstream regions + introns). in that case, the fasta has to contain numbered entries like this:

>HCLS1#1
TTTCAGCGATTTTATTTTCAATTCCAAGGTACTTTTTACAAAAAAAAATG
TATGCAAAATTGACAAACACTGTTACaattaaaaaaataaaaaaataaaa
>HCLS1#2
GCATGCTTGTCTGACTCACATTTTTATTTTGATTTAATTTTTTTAGATTT
>HCLS1#3
TCAACGTAGAAAGTATGTTTATCCAATTAGTGACTAAGATTATGTTCCCT
>ARSA#1
TAATGCATTTTACAAGTCTCAAGAAATCTCAACAAATTTATAGTTAGCAA
>ARSA#2
ATGTGCTTCGCACTTTGGAATAGTAGAAATGTGGGGCGGGTGGGTGGGAA
ACCAACACGTAGAATGATGACAAAACGCCGCTGCGGCCGAGGAAAGATTC

'create_cistarget_motif_databases.py' will then keep the maximum score per gene.

second, you need a set of motifs. they do not necessarily need to be chicken-specific. TF binding specificities have been shown to be conserved across species. so using the e.g. the jaspar vertebrate collection is a good start for chicken. you can check the collections we use here: http://iregulon.aertslab.org/collections.html#motifcolldownload

the motifs need to be in clusterbuster format and saved as individual files in the ${motifs_dir} (cbust format: see requirements here https://orca.bu.edu/page/ClusterBuster_download). you can convert between different motif formats with the Bio.motifs module from BioPython. here is an example for jaspar motifs:

wget http://jaspar.genereg.net/download/CORE/JASPAR2020_CORE_vertebrates_non-redundant_pfms_jaspar.txt 

and then in python:

>>> from Bio import motifs
>>> import os
>>> from pathlib import Path

>>> with open("JASPAR2020_CORE_vertebrates_non-redundant_pfms_jaspar.txt", 'rt') as fh:
        jms = motifs.parse(fh, "JASPAR")
>>> print(jms[0].matrix_id)
MA0006.1
>>> print(jms[0].name)
Ahr::Arnt
>>> print(jms[0].counts)
        0      1      2      3      4      5
A:   3.00   0.00   0.00   0.00   0.00   0.00
C:   8.00   0.00  23.00   0.00   0.00   0.00
G:   2.00  23.00   0.00  23.00   0.00  24.00
T:  11.00   1.00   1.00   1.00  24.00   0.00

>>> print(format(jms[0], "clusterbuster"))
>Ahr::Arnt
3       8       2       11
0       0       23      1
0       23      0       1
0       0       23      1
0       0       0       24
0       0       24      0

>>> outdir = Path('motifs_cb_format')
>>> os.mkdir(outdir)

>>> for jm in jms:
        motiffile = outdir.joinpath(str(jm.matrix_id + '.cb'))
        with open(motiffile, 'wt') as fh:
                fh.write(format(jm, "clusterbuster"))

-> this will save all motifs individually in a folder 'motifs_cb_format' with the name '[matrix_id].cb' you'll also need a file with motifs you want to score (so you can take a subset of the motifs you have):

ls motifs_cb_format/ | sed 's/\.cb//g' >motifs.lst 

if you want to use the motif2tf database, that links motifs to TFs, you can create a custom db for your species. download the motif2TF file for human, mouse or fly here: https://resources.aertslab.org/cistarget/motif2tf/ and replace the human, mouse or fly gene symbols by homologous genes from your species. note that the motif names in this file also have to match the names of the motifs in your ranking db.

hope that helps.

Hi, I'm new in this field, may I ask a simple question? could you please tell me how to replace the human, mouse or fly gene symbols in downloaded motif2TF file by homologous genes from my own species? it will be very appreciated if you can make an example for that, or any links or tutorials! looking forward to your reply!

austinkunch commented 9 months ago

@BioAIEvolu @zealotma @honghh2018 Hi, Could any of you provide me with the Chicken cisTargetDB that you have created with this program? I am working on a publication and can include you as an author if you can provide it to me Thank you, akunch@rockets.utoledo.edu