aertslab / create_cisTarget_databases

Create cisTarget databases
37 stars 8 forks source link

Can I use the bulk ATAC-seq fasta #27

Closed shangguandong1996 closed 1 year ago

shangguandong1996 commented 1 year ago

Hi,

I noticed that according to https://github.com/aertslab/create_cisTarget_databases/issues/4, you mentioned

the easiest approach, especially for species with few genomic resources, is to take regions that are associated with a gene by proximity.

So the reason that choose upstream/downstream is that we do not have region represent TF-binding, so I have to use the proximal gene. But If I have bulk ATAC-seq data which have same tissue and same time-point for my scRNA-seq, Can I use the ATAC-seq peak data to get fasta and then create cisTarget database?

SeppeDeWinter commented 1 year ago

Hi @shangguandong1996

I presume you want to generate a database in order to run pySCENIC/SCENIC?

It is certainly possible to use the ATAC-seq data instead of regions proximal to the genes, it will however require some custom code to link the ATAC-seq peaks to genes (you could link them to the closest gene for instance). This is if you want to use this database to run pySCENIC/SCENIC. If you just want to do motif discovery, linking to genes is not necessary.

I hope this answers your question,

Best.

Seppe

shangguandong1996 commented 1 year ago

Thanks for your reply:). I do want to generate a database for running SCENIC

I also have another question. Accoring to your #4

there can also be several regions per gene (e.g. when you have upstream regions + introns). in that case, the fasta has to contain numbered entries like this: 'create_cistarget_motif_databases.py' will then keep the maximum score per gene.

So I am wondering If I link several peaks/region for one gene, should I merge these peak sequence according to the linked gene like

>HCLS1
TTTCAGCGATTTTATTTTCAATTCCAAGGTACTTTTTACAAAAAAAAATG
TATGCAAAATTGACAAACACTGTTACaattaaaaaaataaaaaaataaaaGCATGCTTGTCTGACTCACATTTTTATTTTGATTTAATTTTTTTAGATTTTCAACGTAGAAAGTATGTTTATCCAATTAGTGACTAAGATTATGTTCCCT
>ARSA
TAATGCATTTTACAAGTCTCAAGAAATCTCAACAAATTTATAGTTAGCAAATGTGCTTCGCACTTTGGAATAGTAGAAATGTGGGGCGGGTGGGTGGGAAACCAACACGTAGAATGATGACAAAACGCCGCTGCGGCCGAGGAAAGATTC

or I just produce a fasta like

>HCLS1#1
TTTCAGCGATTTTATTTTCAATTCCAAGGTACTTTTTACAAAAAAAAATG
TATGCAAAATTGACAAACACTGTTACaattaaaaaaataaaaaaataaaa
>HCLS1#2
GCATGCTTGTCTGACTCACATTTTTATTTTGATTTAATTTTTTTAGATTT
>HCLS1#3
TCAACGTAGAAAGTATGTTTATCCAATTAGTGACTAAGATTATGTTCCCT
>ARSA#1
TAATGCATTTTACAAGTCTCAAGAAATCTCAACAAATTTATAGTTAGCAA
>ARSA#2
ATGTGCTTCGCACTTTGGAATAGTAGAAATGTGGGGCGGGTGGGTGGGAA
ACCAACACGTAGAATGATGACAAAACGCCGCTGCGGCCGAGGAAAGATTC

Best wishes Guandong Shang

ghuls commented 1 year ago

Produce the second FASTA file and run create_cistarget_motif_databases.py later with the -g "#[0-9]+$" option.

create_cistarget_motif_databases.py ... -g "#[0-9]+$"

Preferably your regions should be a bit bigger, else Cluster-Buster will probably not work very well.

shangguandong1996 commented 1 year ago

get it:). Thanks for your reply.

ghuls commented 1 year ago

If all your regions are small, you could first create a BED file with all your regions and use create_fasta_with_padded_bg_from_bed.sh with bg_padding of e.g. 500, to add 500 bp of flanking sequence to each side of your sequences which Cluster-Buster will use to create the background nucleotide frequency.

Then when creating your database add the same number of bg_padding basepairs to the -b option.:

create_cistarget_motif_databases.py ... -b 500 -g "#[0-9]+$"
YaoLi3 commented 4 months ago

@ghuls Could you please elaborate on "small regions"? Generally speaking, when should a region be considered small? Thanks