Creating custom regions_vs_motifs.rankings.feather file

YaoLi3 commented 10 months ago

Dear team,

Thanks a lot for these amazing tools!

While making custom cisTarget database files, I encountered a few problems:

Q1: If my fasta file's headers only contain gene names, does this mean I couldn't generate a regions_vs_motifs.rankings.feather out of the current fasta?

e.g.
>MSTRG.1
CCAGGACCGGTTCAGACAATCGTCACCGCCGTTGGCCTCTGGTGCGGGAATCGAACTCGGGTCACCAGGTTCGTAGCGCTAACCGCTACACCACCGCTCCCACAATGCTCCATACAAAGACGAAGATCCCCCGTGTAGCCTTAACAGACTGTTGGGGCAAGTGCTGTTGGCGAGCGCCCACCACCACTTTCATGCTTTTTTTTTTT

Is there any way that I could find or create a "region" fasta file for my interested species? I'm very new to this field so I'm sorry if my questions don't make sense.

Q2: Assuming I have a regions_vs_motifs.rankings.feather already and the regions are named like chrX:111-333. The fragment names in the ATAC-seq data and the peak names generated by MACS2 are all very different from chrX:111-333. Do I need to manually ensure region names are consistent across various datasets?

Thanks in advance.

Best, Yao

ghuls commented 10 months ago

Q1: If my fasta file's headers only contain gene names, does this mean I couldn't generate a regions_vs_motifs.rankings.feather out of the current fasta?
e.g.
>MSTRG.1
CCAGGACCGGTTCAGACAATCGTCACCGCCGTTGGCCTCTGGTGCGGGAATCGAACTCGGGTCACCAGGTTCGTAGCGCTAACCGCTACACCACCGCTCCCACAATGCTCCATACAAAGACGAAGATCCCCCGTGTAGCCTTAACAGACTGTTGGGGCAAGTGCTGTTGGCGAGCGCCCACCACCACTTTCATGCTTTTTTTTTTT

For both gene and region based databases, the sequences you score should be regulatory regions and not the sequence of the gene itself:

for region-based database, you would use e.g. consensus peaks generated from your scATAC data.
for gene-based databaes, you woould take the seqeunces 5kb or 10kb up and downstream of each TSS and assume later that that those sequences would regulate that gene (which would disregard distal enhancers).

Is there any way that I could find or create a "region" fasta file for my interested species? I'm very new to this field so I'm sorry if my questions don't make sense.

Yes, if you have ATAC-seq, or scATAC data (or the ATAC part of the multiome). Or find those data for your species of interest. Then you need to create consensus peaks (call peaks preferably for each cell type and merge all those peak files in one consensus peak file). Then from that peak (BED) file you can get the associated sequences with bedtools getfasta.

Q2: Assuming I have a regions_vs_motifs.rankings.feather already and the regions are named like chrX:111-333. The fragment names in the ATAC-seq data and the peak names generated by MACS2 are all very different from chrX:111-333. Do I need to manually ensure region names are consistent across various datasets?

As long as the chromosome names match, there should not be a problem. The region names are constructed from the first 3 columns of your BED file and put together like this: chr:start-end. If you run SCENIC+, it will overlap (so no exact match needed) your regions with the regions in the database.

YaoLi3 commented 10 months ago

I see. Thanks a lot for your swift and detailed reply

aertslab / create_cisTarget_databases

Creating custom regions_vs_motifs.rankings.feather file #46