Open stanaka6 opened 2 years ago
A number of years ago we tried to make a database for Zebrafish (assembly: GRCz10). At the time the test (ChIP-seq) datasets (top 500 regions from ChIP-seq) we had didn't work very well when checking if those regions were also in the top of the rankings for motifs for that TF for which the ChIP-seq was done. The new assembly might give better results (assuming that the old assembly had e.g. wrongly assembled regions where e.g. the upstream regions of genes were actually to the correct upstream regions.
You might be able to get a FASTA file via Ensembl Biomart:
Do we need upstream sequences of ALL genes for scoring? If my single-cell data set that I want to do analysis contains about 5000 genes, would it be okay if I use the genomic regions only for those 5000 genes? Does it affect the scoring? Yes, you kind of need all genes for scoring (or it is at least better). If you only include 5000 genes, you will need to adjust the
rank_threshold
parameter in pySCENIC to be much less than 5000 (as you will need your datasets to be enriched in the top of the ranking.
Hi,
Thank you so much for your quick reply. I was able to get the sequences as a fast file. Then I ran create_cistarget_motif_databases.py
. However, I got the following error:
Traceback (most recent call last):
File "/data/users/stanaka/create_cisTarget_databases/create_cistarget_motif_databases.py", line 504, in <module>
main()
File "/data/users/stanaka/create_cisTarget_databases/create_cistarget_motif_databases.py", line 289, in main
region_or_gene_ids = RegionOrGeneIDs.get_region_or_gene_ids_from_fasta(
File "/data/users/stanaka/create_cisTarget_databases/cistarget_db.py", line 150, in get_region_or_gene_ids_from_fasta
region_id = line[1:].split(maxsplit=1)[0]
IndexError: list index out of range
I guess headers in the fasta file or variable setting (genes="#[0-9]+$") cause this error, but I was not able to find a solution. I would be grateful if you could help me.
The code I executed is here:
fasta_filename="Numbered_updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"
motifs_dir="motifs_cb_format"
motifs_list_filename="motifs.lst"
db_prefix="zf1"
nbr_threads=10
genes="#[0-9]+$"
"${create_cistarget_databases_dir}/create_cistarget_motif_databases.py" \
-f "${fasta_filename}" \
-M "${motifs_dir}" \
-m "${motifs_list_filename}" \
-o "${db_prefix}" \
-t "${nbr_threads}" \
-g "${genes}"
For that fasta file, there are several headers named the same gene, so I appended a unique number for duplicates/multiplicate gene names in the fasta using the following python code.
#!/usr/bin/env python
from Bio import SeqIO
records = set()
of = open("Numbered_updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa", "w")
for record in SeqIO.parse("updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa", "fasta"):
ID = record.id
num = 1
while ID in records:
ID = "{}_{}".format(record.id, num)
num += 1
records.add(ID)
record.id = ID
record.name = ID
record.description = ID
SeqIO.write(record, of, "fasta")
of.close()
So, the headers of my fasta file are for example (gene: mxb):
mxb mxb#1 mxb#2 mxb#3
In this case, should I add the number to all genes even if it's unique? Even so, I am not sure how I can do so.
Thank you very much,
At first glans it looks like your have lines which start with >
without identifier.
Can you post the output of the following commands:
# Output of:
file "updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"
file "Numbered_updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"
# Output of:
grep -m 10 '^>' "updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"
grep -m 10 "^>" "Numbered_updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"
I think you also don't need to modify the FASTA file. I think the following will work for you fasta (when I see the output from the previous commands, it will be easier to tell):
genes="\|.+$"
Thank you so much for your reply. Here are the outputs:
file "updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"
updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa: ASCII text
file "Numbered_updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"
Numbered_updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa: ASCII text
grep -m 10 '^>' "updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"
>dap
>triob
>stat1b
>slc35a5
>ptpn4b
>ccdc80
>sema6e
>krt91
>slc9a3r1a
>mcm6l
grep -m 10 "^" "Numbered_updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"
>dap
GTATAGGAAATGAAGCCCCATCTACTTTTACAGGAGCCATTCATTGATCACTGTAGACTG
TTGTTCTCCGTGTGGTCAACACACTGGAATCTCAGCAAATTTTGCTTCAAAGACCTAAGA
AATATATCCAAATAAAGCTTGAAACATGTGCTTTTAAACAAACTCACTACACTTGAAAAC
AAAACTTTTCTGATTTTGTAATCTGTATGAAATGAACATGAGATACAATCCGTCCTGTCA
AATGCACATCATGACAGCAAATACTCAAACAACCACAAACTTGTAAGCGAAGAGTCACTG
TCATTTCTGAAGGAGATTCACCATCAAGACCAAGTTGTGGATTACTTCAGAAGAGCAGCA
CAAAGCATTATCCAGAGGATGATAATGTGTAGAATGGCCAAAATTGAGGTCGGTGTTGTG
AGCGTGTTTCTTTCGAGTGTGCACATGCATTAGCTTAGGCAACAGTAAAATATGCAGATT
TTGTTTGATTCAGACTTTAGAAATGAAATATGAGACAGTTATTGTTAATTTTATTGGTGG
Does any of the following commands print any output (lines with only ">") ?
grep '^>$' "updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"
grep '^>$' "Numbered_updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"
Does samtools complain when you try to make an index for the FASTA files?
samtools faidx "updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"
samtools faidx "Numbered_updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"
Can you upload the FASTA files somewhere?
FASTA files are available from here.
grep '^>$' "updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"
The output is lines with only ">" (135 lines).
grep '^>$' "Numbered_updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"
The output is a line with only ">".
samtools faidx "updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"
# output; first 10 lines
[W::fai_insert_index] Ignoring duplicate sequence "pibf1" at byte offset 6774376
[W::fai_insert_index] Ignoring duplicate sequence "dmd" at byte offset 7490531
[W::fai_insert_index] Ignoring duplicate sequence "tcf7l2" at byte offset 9022025
[W::fai_insert_index] Ignoring duplicate sequence "mxb" at byte offset 9926736
[W::fai_insert_index] Ignoring duplicate sequence "mdh1aa" at byte offset 13173492
[W::fai_insert_index] Ignoring duplicate sequence "fgf13a" at byte offset 15539401
[W::fai_insert_index] Ignoring duplicate sequence "ctbp2a" at byte offset 15872447
[W::fai_insert_index] Ignoring duplicate sequence "cttn" at byte offset 16307386
[W::fai_insert_index] Ignoring duplicate sequence "tnikb" at byte offset 16835923
[W::fai_insert_index] Ignoring duplicate sequence "acana" at byte offset 19803551
samtools faidx "Numbered_updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"
Nothing complain.
I really appreciate your help.
In the original FASTA file there are a lot of genes which have "Sequence unavailable"
# Number of "sequences"
$ grep -c '^>' updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa
57263
# Number of "sequences" without sequence.
$ grep -c '^Sequence unavailable' updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa
23955
# Number of sequences without sequence name.
$ grep -c '^>$' updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa
135
$ diff -u updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.with_sequence.fa | head -n 30
--- updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa 2021-08-06 10:48:11.649268597 +0200
+++ updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.with_sequence.fa 2021-08-06 11:00:29.057775376 +0200
@@ -169,8 +169,6 @@
AATGGGCAGGAAAAAGCCACGTTGACGAAGTCGACGGTAGTCTGCGTTATTTTTCTGCCA
GGAAACTGAAGGATTGTTTTTTTATTTAGAAGACAACGACGCGACTCTTTGGAAGAAGCG
CTTGACAGCAGCC
->triob
-Sequence unavailable
>stat1b
TGTTGCCCAGGGCCGGAGTGGGACTCCTTTTCAGCCCTGGAGTTTCAAGCCTCAGACCGG
CCCACCTCAGTTCACGACTGACTATATTAAAATAAGGTCATTTCCAATTCAGTTTCTAAT
@@ -693,8 +691,6 @@
ACAGTGTTCATTTATAGCTACAGGAAATTTTTAGGCGGAGTGCAGATGAATGTTTGTAAA
TGATGTTTGCCGTTGGATCACCTGGGCTAGTGTGTGTTTTACACTGCCTAATAATGTGTA
CTGCAGGTCAAATGACCTTTAGCCTCTGTGTGTTTTGCAGACGGAGTGGGGACCGTA
->ccdc80
-Sequence unavailable
>sema6e
CTTAACAGATCCCCAGTAATAATCATAAACCTAATCTTAACCCAATACTGTGCTTTTTTA
ATCTCTTGCTGCCTAAATGGAATATGATCCCAAACCTGCATTTAATCTTGATCCAGGGCC
@@ -2919,10 +2915,6 @@
ATAAAGTAGTCGCTATATATCTGCTCGCTCTCACAAATTTGAGAAAAGCGTCTTGTCGTG
GCAGGGGACGTTATACATGGTTATCCAAATCTTGGAATACGGACACTTATAGGTCCCCCA
TCACATTACGTTTCTCAGAA
->tnni1a
-Sequence unavailable
->slc8a1b
-Sequence unavailable
>gtf2e1
CCGATGCTGAAACAATATACTGTGCAGCCCTAGTGTCAAGTCTTAAAAGTCAGAGCAAAA
Remove those sequences without sequences and sequences without sequence name:
awk '
BEGIN {
is_empty_sequence_name = 0;
}
{
line = $0;
if ( line ~ /^>$/ ) {
is_empty_sequence_name = 1;
} else if ( line ~ /^>/ ) {
is_empty_sequence_name = 0;
# Store sequence name line.
sequence_name_line = line;
# Get next line and store in sequence.
getline sequence_line;
# Print sequence name and first sequence line if it is a real sequence (else do not print anything).
if ( sequence_line != "Sequence unavailable" ) {
print sequence_name_line "\n" sequence_line;
}
} else if ( is_empty_sequence_name == 0 ) {
# Print all other sequence lines if there was a sequence name for that sequence.
print $0;
}
}' \
updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa \
> updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.with_sequence.fa
With the new FASTA file, the following should work:
create_cistarget_motif_databases.py \
-f updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.with_sequence.fa \
-M . -m /etc/passwd -o test. -t 1 -g '#?$'
fasta_filename="updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.with_sequence.fa"
motifs_dir="motifs_cb_format"
motifs_list_filename="motifs.lst"
db_prefix="zf1"
nbr_threads=10
# Search for 0 or 1 "#" at the end of the sequence names (will always be 0) to force it to consider all sequence names as genes instead of regions.
extract_gene_id_from_region_id_replace="#?$"
"${create_cistarget_databases_dir}/create_cistarget_motif_databases.py" \
-f "${fasta_filename}" \
-M "${motifs_dir}" \
-m "${motifs_list_filename}" \
-o "${db_prefix}" \
-t "${nbr_threads}" \
-g "${extract_gene_id_from_region_id_replace}"
Hi, I was able to run create_cistarget_motif_databases.py using the cleaned FASTA file you explained above. Thank you so much for your help. I checked one of the feather files in R.
db <- importRankings("Test/zf1.genes_vs_motifs.rankings.feather", indexCol = "motifs")
View(db@rankings)
Now another two questions are:
Thank you!
Hi,
Please ignore my last question1. I didn't check the entire data, and there is no conversion for lower and upper cases. My bad, sorry.
Seeing that you have successfully built the zebrafish cisTarget_databases, can you share the data of the successful construction with me, I have not been able to build successfully @stanaka6
@yanpinlu @stanaka6 Hello! Have you succeeded in creating the zebrafish cisTarget_database? I was still unable to created it, and can you share the file with me?
I'm sorry, I haven't made it yet.
------------------ 原始邮件 ------------------ 发件人: "aertslab/create_cisTarget_databases" @.>; 发送时间: 2022年7月25日(星期一) 上午8:44 @.>; @.**@.>; 主题: Re: [aertslab/create_cisTarget_databases] Create a zebrafish db: question about the best fasta file (#8)
@yanpinlu @stanaka6 Hello! Have you succeeded in creating the zebrafish cisTarget_database? I was still unable to created it, and can you share the file with me?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>
Thanks! 1817061039@stmail.ntu.edu.cn
I'm so sorry. Can you send message to my gmail mailbox? My gmail is xiaozhongshen991001@gmail.com @yanpinlu
Hi,
I am having an issue as well. I got my fasta file but I am not sure how to get the motifs db and list correctly.
"${create_cistarget_databases_dir}/create_cistarget_motif_databases.py" \ -f "${fasta_filename}" \ -M "${motifs_dir}" \ -m "${motifs_list_filename}" \ -o "${db_prefix}" \ -t "${nbr_threads}" \ -g "${extract_gene_id_from_region_id_replace}"
This my code.
I got this motif file https://jaspar.genereg.net/download/data/2022/CORE/JASPAR2022_CORE_non-redundant_pfms_jaspar.txt
and I followed the gallus species issue and they said in python I can use
for jm in jms: motiffile = outdir.joinpath(str(jm.matrix_id + '.cb')) with open(motiffile, 'wt') as fh: fh.write(format(jm, "clusterbuster"))
and I got the motif directory and the list. But I am having trouble.
Error: None of 1956 motifs were scored successfully.
this is the error I get. I also have zebrafish genes. Can anyone tell me or send me their code how to prepare this motifs directory and file.
Hi,
I am having an issue as well. I got my fasta file but I am not sure how to get the motifs db and list correctly.
"${create_cistarget_databases_dir}/create_cistarget_motif_databases.py" -f "${fasta_filename}" -M "${motifs_dir}" -m "${motifs_list_filename}" -o "${db_prefix}" -t "${nbr_threads}" -g "${extract_gene_id_from_region_id_replace}"
This my code.
I got this motif file https://jaspar.genereg.net/download/data/2022/CORE/JASPAR2022_CORE_non-redundant_pfms_jaspar.txt
and I followed the gallus species issue and they said in python I can use
for jm in jms: motiffile = outdir.joinpath(str(jm.matrix_id + '.cb')) with open(motiffile, 'wt') as fh: fh.write(format(jm, "clusterbuster"))
and I got the motif directory and the list. But I am having trouble.
Error: None of 1956 motifs were scored successfully.
this is the error I get. I also have zebrafish genes. Can anyone tell me or send me their code how to prepare this motifs directory and file.
You can refer to the method in the link https://github.com/aertslab/create_cisTarget_databases/issues/4
Hi, I was able to run create_cistarget_motif_databases.py using the cleaned FASTA file you explained above. Thank you so much for your help. I checked one of the feather files in R.
Dear @stanaka6 , Do you mind sharing the database you created for zebrafish using "create_cistarget_motif_databases.py"? I would be really grateful. Thanks!
@stanaka6 @Mesi395 @yanpinlu @xiaozhongshen @Sali120 Hi, did anyone end up creating the zebrafish database? Thanks a lot in advance.
Hi team,
Thank you for the nice explanation about database building. I want to create a database for zebrafish using "create_cistarget_motif_databases.py". I am struggling to get a zebrafish genome region fasta file containing up- and downstream sequences of all genes like "upstream5000.fa.gz" for chicken from UCSC mentioned in issue #4. I first downloaded a bed file from the EPD sequence extraction tool with settings -10000 and +10000, but all the intervals in the bed file are 1 bp. Here are the first 5 lines.
I found an explanation why the intervals are 1 bp from an EPD organizer (https://groups.google.com/g/ask-epd/c/PsGI5z3Hh1w/m/qq0NUddwBQAJ). Using Ensembl Biomart, we can get the upstream sequence of specific genes as a fasta file when we input a list of reference IDs. However, the list can contain max of 500 genes. So my question is:
Thanks!