Create a zebrafish db: question about the best fasta file

stanaka6 commented 2 years ago

Hi team,

Thank you for the nice explanation about database building. I want to create a database for zebrafish using "create_cistarget_motif_databases.py". I am struggling to get a zebrafish genome region fasta file containing up- and downstream sequences of all genes like "upstream5000.fa.gz" for chicken from UCSC mentioned in issue #4. I first downloaded a bed file from the EPD sequence extraction tool with settings -10000 and +10000, but all the intervals in the bed file are 1 bp. Here are the first 5 lines.

chr1    11963   11964   cep97_1 1   +
chr1    27790   27791   eed_1   1   +
chr1    36733   36734   C1H11orf73_1    1   +
chr1    44582   44583   tmem39a_1   1   -
chr1    49699   49700   ildr1a_2    1   -

I found an explanation why the intervals are 1 bp from an EPD organizer (https://groups.google.com/g/ask-epd/c/PsGI5z3Hh1w/m/qq0NUddwBQAJ). Using Ensembl Biomart, we can get the upstream sequence of specific genes as a fasta file when we input a list of reference IDs. However, the list can contain max of 500 genes. So my question is:

Do you have any suggestion or idea that we can get such a fasta file for zebrafish?
Do we need upstream sequences of ALL genes for scoring? If my single-cell data set that I want to do analysis contains about 5000 genes, would it be okay if I use the genomic regions only for those 5000 genes? Does it affect the scoring?

Thanks!

ghuls commented 2 years ago

A number of years ago we tried to make a database for Zebrafish (assembly: GRCz10). At the time the test (ChIP-seq) datasets (top 500 regions from ChIP-seq) we had didn't work very well when checking if those regions were also in the top of the rankings for motifs for that TF for which the ChIP-seq was done. The new assembly might give better results (assuming that the old assembly had e.g. wrongly assembled regions where e.g. the upstream regions of genes were actually to the correct upstream regions.

You might be able to get a FASTA file via Ensembl Biomart:

Selecting protein coding genes only
Sequences
5' UTR + 5000bp upstream + 5000 downstream

Do we need upstream sequences of ALL genes for scoring? If my single-cell data set that I want to do analysis contains about 5000 genes, would it be okay if I use the genomic regions only for those 5000 genes? Does it affect the scoring? Yes, you kind of need all genes for scoring (or it is at least better). If you only include 5000 genes, you will need to adjust the rank_threshold parameter in pySCENIC to be much less than 5000 (as you will need your datasets to be enriched in the top of the ranking.

stanaka6 commented 2 years ago

Hi, Thank you so much for your quick reply. I was able to get the sequences as a fast file. Then I ran create_cistarget_motif_databases.py. However, I got the following error:

Traceback (most recent call last):
  File "/data/users/stanaka/create_cisTarget_databases/create_cistarget_motif_databases.py", line 504, in <module>
    main()
  File "/data/users/stanaka/create_cisTarget_databases/create_cistarget_motif_databases.py", line 289, in main
    region_or_gene_ids = RegionOrGeneIDs.get_region_or_gene_ids_from_fasta(
  File "/data/users/stanaka/create_cisTarget_databases/cistarget_db.py", line 150, in get_region_or_gene_ids_from_fasta
    region_id = line[1:].split(maxsplit=1)[0]
IndexError: list index out of range

I guess headers in the fasta file or variable setting (genes="#[0-9]+$") cause this error, but I was not able to find a solution. I would be grateful if you could help me.

The code I executed is here:

fasta_filename="Numbered_updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"
motifs_dir="motifs_cb_format"
motifs_list_filename="motifs.lst"
db_prefix="zf1"
nbr_threads=10
genes="#[0-9]+$"

"${create_cistarget_databases_dir}/create_cistarget_motif_databases.py" \
    -f "${fasta_filename}" \
    -M "${motifs_dir}" \
    -m "${motifs_list_filename}" \
    -o "${db_prefix}" \
    -t "${nbr_threads}" \
    -g "${genes}"

For that fasta file, there are several headers named the same gene, so I appended a unique number for duplicates/multiplicate gene names in the fasta using the following python code.

#!/usr/bin/env python

from Bio import SeqIO

records = set()
of = open("Numbered_updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa", "w")
for record in SeqIO.parse("updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa", "fasta"):
    ID = record.id
    num = 1
    while ID in records:
        ID = "{}_{}".format(record.id, num)
        num += 1
    records.add(ID)
    record.id = ID
    record.name = ID
    record.description = ID
    SeqIO.write(record, of, "fasta")
of.close()

So, the headers of my fasta file are for example (gene: mxb):

mxb mxb#1 mxb#2 mxb#3

In this case, should I add the number to all genes even if it's unique? Even so, I am not sure how I can do so.

Thank you very much,

ghuls commented 2 years ago

At first glans it looks like your have lines which start with > without identifier.

Can you post the output of the following commands:

# Output of:
file "updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"
file "Numbered_updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"

# Output of:
grep -m 10 '^>' "updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"

grep -m 10 "^>" "Numbered_updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"

I think you also don't need to modify the FASTA file. I think the following will work for you fasta (when I see the output from the previous commands, it will be easier to tell):

genes="\|.+$"

stanaka6 commented 2 years ago

Thank you so much for your reply. Here are the outputs:

file "updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"
updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa: ASCII text

file "Numbered_updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"
Numbered_updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa: ASCII text

grep -m 10 '^>' "updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"
>dap
>triob
>stat1b
>slc35a5
>ptpn4b
>ccdc80
>sema6e
>krt91
>slc9a3r1a
>mcm6l

grep -m 10 "^" "Numbered_updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"
>dap
GTATAGGAAATGAAGCCCCATCTACTTTTACAGGAGCCATTCATTGATCACTGTAGACTG
TTGTTCTCCGTGTGGTCAACACACTGGAATCTCAGCAAATTTTGCTTCAAAGACCTAAGA
AATATATCCAAATAAAGCTTGAAACATGTGCTTTTAAACAAACTCACTACACTTGAAAAC
AAAACTTTTCTGATTTTGTAATCTGTATGAAATGAACATGAGATACAATCCGTCCTGTCA
AATGCACATCATGACAGCAAATACTCAAACAACCACAAACTTGTAAGCGAAGAGTCACTG
TCATTTCTGAAGGAGATTCACCATCAAGACCAAGTTGTGGATTACTTCAGAAGAGCAGCA
CAAAGCATTATCCAGAGGATGATAATGTGTAGAATGGCCAAAATTGAGGTCGGTGTTGTG
AGCGTGTTTCTTTCGAGTGTGCACATGCATTAGCTTAGGCAACAGTAAAATATGCAGATT
TTGTTTGATTCAGACTTTAGAAATGAAATATGAGACAGTTATTGTTAATTTTATTGGTGG

ghuls commented 2 years ago

Does any of the following commands print any output (lines with only ">") ?

grep '^>$' "updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"

grep '^>$' "Numbered_updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"

Does samtools complain when you try to make an index for the FASTA files?

samtools faidx "updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"

samtools faidx "Numbered_updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"

Can you upload the FASTA files somewhere?

stanaka6 commented 2 years ago

FASTA files are available from here.

grep '^>$' "updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"

The output is lines with only ">" (135 lines).

grep '^>$' "Numbered_updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"

The output is a line with only ">".

samtools faidx "updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"
# output; first 10 lines
[W::fai_insert_index] Ignoring duplicate sequence "pibf1" at byte offset 6774376
[W::fai_insert_index] Ignoring duplicate sequence "dmd" at byte offset 7490531
[W::fai_insert_index] Ignoring duplicate sequence "tcf7l2" at byte offset 9022025
[W::fai_insert_index] Ignoring duplicate sequence "mxb" at byte offset 9926736
[W::fai_insert_index] Ignoring duplicate sequence "mdh1aa" at byte offset 13173492
[W::fai_insert_index] Ignoring duplicate sequence "fgf13a" at byte offset 15539401
[W::fai_insert_index] Ignoring duplicate sequence "ctbp2a" at byte offset 15872447
[W::fai_insert_index] Ignoring duplicate sequence "cttn" at byte offset 16307386
[W::fai_insert_index] Ignoring duplicate sequence "tnikb" at byte offset 16835923
[W::fai_insert_index] Ignoring duplicate sequence "acana" at byte offset 19803551

samtools faidx "Numbered_updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa"

Nothing complain.

I really appreciate your help.

ghuls commented 2 years ago

In the original FASTA file there are a lot of genes which have "Sequence unavailable"

# Number of "sequences"
$ grep -c '^>' updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa
57263

# Number of "sequences" without sequence.
$ grep -c '^Sequence unavailable' updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa
23955

# Number of sequences without sequence name.
$ grep -c '^>$' updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa
135

$ diff -u updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.with_sequence.fa | head -n 30
--- updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa 2021-08-06 10:48:11.649268597 +0200
+++ updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.with_sequence.fa   2021-08-06 11:00:29.057775376 +0200
@@ -169,8 +169,6 @@
 AATGGGCAGGAAAAAGCCACGTTGACGAAGTCGACGGTAGTCTGCGTTATTTTTCTGCCA
 GGAAACTGAAGGATTGTTTTTTTATTTAGAAGACAACGACGCGACTCTTTGGAAGAAGCG
 CTTGACAGCAGCC
->triob
-Sequence unavailable
 >stat1b
 TGTTGCCCAGGGCCGGAGTGGGACTCCTTTTCAGCCCTGGAGTTTCAAGCCTCAGACCGG
 CCCACCTCAGTTCACGACTGACTATATTAAAATAAGGTCATTTCCAATTCAGTTTCTAAT
@@ -693,8 +691,6 @@
 ACAGTGTTCATTTATAGCTACAGGAAATTTTTAGGCGGAGTGCAGATGAATGTTTGTAAA
 TGATGTTTGCCGTTGGATCACCTGGGCTAGTGTGTGTTTTACACTGCCTAATAATGTGTA
 CTGCAGGTCAAATGACCTTTAGCCTCTGTGTGTTTTGCAGACGGAGTGGGGACCGTA
->ccdc80
-Sequence unavailable
 >sema6e
 CTTAACAGATCCCCAGTAATAATCATAAACCTAATCTTAACCCAATACTGTGCTTTTTTA
 ATCTCTTGCTGCCTAAATGGAATATGATCCCAAACCTGCATTTAATCTTGATCCAGGGCC
@@ -2919,10 +2915,6 @@
 ATAAAGTAGTCGCTATATATCTGCTCGCTCTCACAAATTTGAGAAAAGCGTCTTGTCGTG
 GCAGGGGACGTTATACATGGTTATCCAAATCTTGGAATACGGACACTTATAGGTCCCCCA
 TCACATTACGTTTCTCAGAA
->tnni1a
-Sequence unavailable
->slc8a1b
-Sequence unavailable
 >gtf2e1
 CCGATGCTGAAACAATATACTGTGCAGCCCTAGTGTCAAGTCTTAAAAGTCAGAGCAAAA

Remove those sequences without sequences and sequences without sequence name:

awk '
    BEGIN {
        is_empty_sequence_name = 0;
    }
    {
        line = $0;

        if ( line ~ /^>$/ ) {
            is_empty_sequence_name = 1;
        } else if ( line ~ /^>/ ) {
            is_empty_sequence_name = 0;

            # Store sequence name line.
            sequence_name_line = line;
            # Get next line and store in sequence.
            getline sequence_line;
            # Print sequence name and first sequence line if it is a real sequence (else do not print anything).
            if ( sequence_line != "Sequence unavailable" ) {
                print sequence_name_line "\n" sequence_line;
            }
        } else if ( is_empty_sequence_name == 0 ) {
            # Print all other sequence lines if there was a sequence name for that sequence.
            print $0;
       }
    }' \
    updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.fa \
  > updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.with_sequence.fa

With the new FASTA file, the following should work:


create_cistarget_motif_databases.py \
    -f updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.with_sequence.fa \
    -M . -m /etc/passwd -o test. -t 1 -g '#?$'
fasta_filename="updown10k-5UTR_zf_GRCz11_genesymbol_biomart_nonunique_0804233944_128.with_sequence.fa"
motifs_dir="motifs_cb_format"
motifs_list_filename="motifs.lst"
db_prefix="zf1"
nbr_threads=10
# Search for 0 or 1 "#" at the end of the sequence names (will always be 0) to force it to consider all sequence names as genes instead of regions.
extract_gene_id_from_region_id_replace="#?$"

"${create_cistarget_databases_dir}/create_cistarget_motif_databases.py" \
    -f "${fasta_filename}" \
    -M "${motifs_dir}" \
    -m "${motifs_list_filename}" \
    -o "${db_prefix}" \
    -t "${nbr_threads}" \
    -g "${extract_gene_id_from_region_id_replace}"

stanaka6 commented 2 years ago

Hi, I was able to run create_cistarget_motif_databases.py using the cleaned FASTA file you explained above. Thank you so much for your help. I checked one of the feather files in R.

db <- importRankings("Test/zf1.genes_vs_motifs.rankings.feather", indexCol = "motifs")
View(db@rankings)

Now another two questions are:

Are the gene names automatically capitalized when creating the databases? I meant that zebrafish genes are a mixture of lower scale and capitals. For instance, cga and CGA are different genes. I would appreciate it if you could provide a way to keep the original characters of gene names in the motif databases. Should I modify the original python script?
I checked the Ensemble database; It seems that the genes that their 5'UTR have not been identified show "Sequence unavailable" in my original fasta data created from Ensemble Biomart. Of course, when removing the genes with "Sequence unavailable", those genes info was eliminated from the FASTA file. Thus, the FASTA file used for creating databases doesn't contain kind all genes. Do you think I need to adjust a parameter in SCENIC (like you mentioned previously)? Any thoughts on how it does affect downstream procedure and analysis?

Thank you!

stanaka6 commented 2 years ago

Hi,

Please ignore my last question1. I didn't check the entire data, and there is no conversion for lower and upper cases. My bad, sorry.

yanpinlu commented 1 year ago

Seeing that you have successfully built the zebrafish cisTarget_databases, can you share the data of the successful construction with me, I have not been able to build successfully @stanaka6

xiaozhongshen commented 1 year ago

@yanpinlu @stanaka6 Hello! Have you succeeded in creating the zebrafish cisTarget_database? I was still unable to created it, and can you share the file with me?

yanpinlu commented 1 year ago

I'm sorry, I haven't made it yet.

------------------ 原始邮件 ------------------ 发件人: "aertslab/create_cisTarget_databases" @.>; 发送时间: 2022年7月25日(星期一) 上午8:44 @.>; @.**@.>; 主题: Re: [aertslab/create_cisTarget_databases] Create a zebrafish db: question about the best fasta file (#8)

@yanpinlu @stanaka6 Hello! Have you succeeded in creating the zebrafish cisTarget_database? I was still unable to created it, and can you share the file with me?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

xiaozhongshen commented 1 year ago

Thanks! 1817061039@stmail.ntu.edu.cn

xiaozhongshen commented 1 year ago

I'm so sorry. Can you send message to my gmail mailbox? My gmail is xiaozhongshen991001@gmail.com @yanpinlu

Sali120 commented 1 year ago

Hi,

I am having an issue as well. I got my fasta file but I am not sure how to get the motifs db and list correctly.

"${create_cistarget_databases_dir}/create_cistarget_motif_databases.py" \ -f "${fasta_filename}" \ -M "${motifs_dir}" \ -m "${motifs_list_filename}" \ -o "${db_prefix}" \ -t "${nbr_threads}" \ -g "${extract_gene_id_from_region_id_replace}"

This my code.

I got this motif file https://jaspar.genereg.net/download/data/2022/CORE/JASPAR2022_CORE_non-redundant_pfms_jaspar.txt

and I followed the gallus species issue and they said in python I can use

for jm in jms: motiffile = outdir.joinpath(str(jm.matrix_id + '.cb')) with open(motiffile, 'wt') as fh: fh.write(format(jm, "clusterbuster"))

and I got the motif directory and the list. But I am having trouble.

Error: None of 1956 motifs were scored successfully.

this is the error I get. I also have zebrafish genes. Can anyone tell me or send me their code how to prepare this motifs directory and file.

yanpinlu commented 1 year ago

Hi,

I am having an issue as well. I got my fasta file but I am not sure how to get the motifs db and list correctly.

"${create_cistarget_databases_dir}/create_cistarget_motif_databases.py" -f "${fasta_filename}" -M "${motifs_dir}" -m "${motifs_list_filename}" -o "${db_prefix}" -t "${nbr_threads}" -g "${extract_gene_id_from_region_id_replace}"

This my code.

I got this motif file https://jaspar.genereg.net/download/data/2022/CORE/JASPAR2022_CORE_non-redundant_pfms_jaspar.txt

and I followed the gallus species issue and they said in python I can use

for jm in jms: motiffile = outdir.joinpath(str(jm.matrix_id + '.cb')) with open(motiffile, 'wt') as fh: fh.write(format(jm, "clusterbuster"))

and I got the motif directory and the list. But I am having trouble.

Error: None of 1956 motifs were scored successfully.

this is the error I get. I also have zebrafish genes. Can anyone tell me or send me their code how to prepare this motifs directory and file.

You can refer to the method in the link https://github.com/aertslab/create_cisTarget_databases/issues/4

Mesi395 commented 1 year ago

Hi, I was able to run create_cistarget_motif_databases.py using the cleaned FASTA file you explained above. Thank you so much for your help. I checked one of the feather files in R.

Dear @stanaka6 , Do you mind sharing the database you created for zebrafish using "create_cistarget_motif_databases.py"? I would be really grateful. Thanks!

mtrebelo commented 1 year ago

@stanaka6 @Mesi395 @yanpinlu @xiaozhongshen @Sali120 Hi, did anyone end up creating the zebrafish database? Thanks a lot in advance.

aertslab / create_cisTarget_databases

Create a zebrafish db: question about the best fasta file #8