Closed shangguandong1996 closed 1 year ago
Hi @shangguandong1996
I presume you want to generate a database in order to run pySCENIC/SCENIC?
It is certainly possible to use the ATAC-seq data instead of regions proximal to the genes, it will however require some custom code to link the ATAC-seq peaks to genes (you could link them to the closest gene for instance). This is if you want to use this database to run pySCENIC/SCENIC. If you just want to do motif discovery, linking to genes is not necessary.
I hope this answers your question,
Best.
Seppe
Thanks for your reply:). I do want to generate a database for running SCENIC
I also have another question. Accoring to your #4
there can also be several regions per gene (e.g. when you have upstream regions + introns). in that case, the fasta has to contain numbered entries like this: 'create_cistarget_motif_databases.py' will then keep the maximum score per gene.
So I am wondering If I link several peaks/region for one gene, should I merge these peak sequence according to the linked gene like
>HCLS1
TTTCAGCGATTTTATTTTCAATTCCAAGGTACTTTTTACAAAAAAAAATG
TATGCAAAATTGACAAACACTGTTACaattaaaaaaataaaaaaataaaaGCATGCTTGTCTGACTCACATTTTTATTTTGATTTAATTTTTTTAGATTTTCAACGTAGAAAGTATGTTTATCCAATTAGTGACTAAGATTATGTTCCCT
>ARSA
TAATGCATTTTACAAGTCTCAAGAAATCTCAACAAATTTATAGTTAGCAAATGTGCTTCGCACTTTGGAATAGTAGAAATGTGGGGCGGGTGGGTGGGAAACCAACACGTAGAATGATGACAAAACGCCGCTGCGGCCGAGGAAAGATTC
or I just produce a fasta like
>HCLS1#1
TTTCAGCGATTTTATTTTCAATTCCAAGGTACTTTTTACAAAAAAAAATG
TATGCAAAATTGACAAACACTGTTACaattaaaaaaataaaaaaataaaa
>HCLS1#2
GCATGCTTGTCTGACTCACATTTTTATTTTGATTTAATTTTTTTAGATTT
>HCLS1#3
TCAACGTAGAAAGTATGTTTATCCAATTAGTGACTAAGATTATGTTCCCT
>ARSA#1
TAATGCATTTTACAAGTCTCAAGAAATCTCAACAAATTTATAGTTAGCAA
>ARSA#2
ATGTGCTTCGCACTTTGGAATAGTAGAAATGTGGGGCGGGTGGGTGGGAA
ACCAACACGTAGAATGATGACAAAACGCCGCTGCGGCCGAGGAAAGATTC
Best wishes Guandong Shang
Produce the second FASTA file and run create_cistarget_motif_databases.py
later with the -g "#[0-9]+$"
option.
create_cistarget_motif_databases.py ... -g "#[0-9]+$"
Preferably your regions should be a bit bigger, else Cluster-Buster will probably not work very well.
get it:). Thanks for your reply.
If all your regions are small, you could first create a BED file with all your regions and use create_fasta_with_padded_bg_from_bed.sh
with bg_padding
of e.g. 500, to add 500 bp of flanking sequence to each side of your sequences which Cluster-Buster will use to create the background nucleotide frequency.
Then when creating your database add the same number of bg_padding
basepairs to the -b
option.:
create_cistarget_motif_databases.py ... -b 500 -g "#[0-9]+$"
@ghuls Could you please elaborate on "small regions"? Generally speaking, when should a region be considered small? Thanks
Hi,
I noticed that according to https://github.com/aertslab/create_cisTarget_databases/issues/4, you mentioned
So the reason that choose upstream/downstream is that we do not have region represent TF-binding, so I have to use the proximal gene. But If I have bulk ATAC-seq data which have same tissue and same time-point for my scRNA-seq, Can I use the ATAC-seq peak data to get fasta and then create cisTarget database?