> @mtxellrb A script for creating a track database from bigWig TF ChIP-seq data is now added :`create_cistarget_track_databases.py`

MatthewTCManion commented 2 months ago

          > @mtxellrb A script for creating a track database from bigWig TF ChIP-seq data is now added :`create_cistarget_track_databases.py`

https://github.com/aertslab/create_cisTarget_databases#create_cistarget_track_databasespy

Hello , I am running into an issue using this script where the .bed file with regions to score is not recognized correctly, and I have tried a few different formats with no success. For reference, here is a screenshot of my most recent attempt to run the script, as well as the format of my .bed:

REGION_BED="/data/PetrosLab/Matt/scenicplus/chipseq/tracks/fwf_gene_assignments.bed"
DATABASE_PREFIX="CellType_750bp_with_binding"
SCRIPT_DIR="/data/PetrosLab/Matt/scenicplus/create_cisTarget_databases"
TRACKS_DIR="/data/PetrosLab/Matt/scenicplus/chipseq/tracks"
TRACK_LIST="track_names.txt"

"${SCRIPT_DIR}/create_cistarget_track_databases.py" \
    -b "${REGION_BED}" \
    -T "${TRACKS_DIR}" \
    -d "${TRACK_LIST}" \
    -o "${DATABASE_PREFIX}" \
    -t 20

I assume the issue is with the format of the .bed or the genes/regions data, but I can't find what the proper format should be.

Thanks, Matt

_Originally posted by @MatthewTCManion in https://github.com/aertslab/create_cisTarget_databases/issues/17#issuecomment-2317906079_

ghuls commented 2 months ago

Your BED file is not in BED format: chr start end name:

This will fix it:

tail -n +2 fwf_gene_assignments.bed | awk -F "\t" -v 'OFS=\t' '{ print $2, $3, $4, $1; }' > fwf_gene_assignments.fixed.bed

MatthewTCManion commented 2 months ago

Your BED file is not in BED format: chr start end name:

This will fix it:
tail -n +2 fwf_gene_assignments.bed | awk -F "\t" -v 'OFS=\t' '{ print $2, $3, $4, $1; }' > fwf_gene_assignments.fixed.bed

Thank you for your reply! I tried running it with this suggestion, and the BED file now seems to be in the correct format, but the same error came up:

Please let me know if you have any additional suggestions!

Best, Matt

MatthewTCManion commented 2 months ago

Okay so I was able to get the BED file into proper format,

and ran create_cistarget_track_databases.py,

but I am getting an issue with scoring the tracks for each of my 4 input .bw files:

ghuls commented 2 months ago

What is the output of for example the first bigWigAverageOverBed command?:

/data/PetrosLab/Conda/envs/create_cistarget_databases/bin/bigAveragedOverBed -minMax /data/PetrosLab/Matt/scenicplus/chipseq/tracks/MGE2-60_S13.cpm.norm.bw /data/PetrosLab/Matt/scenicplus/chipseq/tracks/consensus_tabbed.bed /dev/stdout

(best copy it from your slurm output as I might have made typos)

You probably should make the BED files like I suggested:

tail -n +2 fwf_gene_assignments.bed | awk -F "\t" -v 'OFS=\t' '{ print $2, $3, $4, $1; }' > fwf_gene_assignments.fixed.bed

with only 4 columns, as bigWigAverageOverBed probably does not like your 5th column.

Creating a track database also only would make some sense when you have several hundreds of tracks (even better thousands), instead of only 4.

MatthewTCManion commented 2 months ago

Thanks for your help,I dropped it to 4 columns and it seems to have ran and created the .feather rankings and scores.

MatthewTCManion commented 2 months ago

Creating a track database also only would make some sense when you have several hundreds of tracks (even better thousands), instead of only 4.

Can you clarify what you mean by this to ensure I am using this pipeline correctly?

I have run SCENIC+ using motifs and the motif database, which I understand has thousands of motifs.

My understanding is that I am creating the track database from my ChIP-seq data (one track per sample), and scoring binding in these tracks against the region BED created by running pycisTopic on my ATAC-seq data to create a database that can be used in the SCENIC+ pipeline as way to target regions where I have TF binding in my samples. Am I understanding that incorrectly?

If so, do you have any suggestions on the proper way to incorporate my TF-binding data into identification of GRNs?

Thanks, Matt

SeppeDeWinter commented 2 months ago

Hi @MatthewTCManion

Do you want to create a single database containing both motif and ChIP-seq scores? Or a database containing only ChIP-seq data?

The latter probably does not make a lot of sense, given that you only have few tracks calculating the enrichment values will be impossible.

The former might be a good idea, however in the context of SCENIC+ we have not tried this yet.

All the best,

Seppe

MatthewTCManion commented 2 months ago

@SeppeDeWinter Ideally, a database containing both motif and ChIP-seq scores. Thank you for the reply!

MatthewTCManion commented 2 months ago

@SeppeDeWinter Do you think it would be possible to use the outputs of a motif-discovery tool (we have successfully identified motifs in our ChIP-seq data using the MEME suite of tools) to create a motif database for SCENIC+ targeted to motifs in our ChIP-seq set?

ghuls commented 2 months ago

@MatthewTCManion That should be possible. Convert your motifs (from Homer/MEME/...) to ClusterBuster format (you can use BioPython for this if you want. Make sure that your PWM contains counts and not frequencies (else multiply by 100))) and use that together with our provide motif collection to make your own database. Later you will have to add your motif to the motif2tf table so if your motif is found, it will actually be used by the SCENIC+ analysis.

Against which TFs did you do ChIP-seq and are the motifs you obtain not in our motif collection?

MatthewTCManion commented 1 month ago

I will try that!

We're using Nkx2.1 for our ChIP-seq

ghuls commented 1 month ago

Looks like a motif for Nkx2-1 is at least in JASPAR, so it should be detected with our default motif collection already: https://jaspar.elixir.no/matrix/MA1994.2/

MatthewTCManion commented 1 month ago

We have seen some inconsistency between the Nkx2-1 motif between different databases, so one thing we' have done is to generate the motif from our own binding data in multiple Nkx2-1 expressing tissues to hopefully capture a more consistent motif, but I agree it should at least partially resemble the JASPAR Nkx2-1 motif

HOCOMOCO

JASPAR

aertslab / create_cisTarget_databases

> @mtxellrb A script for creating a track database from bigWig TF ChIP-seq data is now added :`create_cistarget_track_databases.py` #52