aertslab / create_cisTarget_databases

Create cisTarget databases
40 stars 8 forks source link

Database creation for Killifish #53

Open juhi-ku opened 1 week ago

juhi-ku commented 1 week ago

Hello,

I am working on TBI in killifish using bioinformatics approach. I want to investigate killifish clusters and it’s subclusters as it is not very well studied. I want to discover the cell types and clusters and to do so I was planning to work on Gene Regulatory analysis using SCENIC as it has an option to make custom database. Since you and your other colleagues have made the software, I was hoping if you could help me in creating the database for killifish as well. For Killifish we have Ensemble genome (https://www.ensembl.org/Nothobranchius_furzeri/Info/Index ) and Transcription factor list (from the TFDB database) but we do not have any Motif information. Can you let me know if this information is sufficient to proceed? I had a look already into the codes on GitHub, but I still wanted to ask and confirm and seems a bit tough myself to proceed so if you could help me or let me know how it’s done, would be a great help in my research and to the killifish community as well!

ghuls commented 6 days ago

For the motif to TF table you could take the fly/mouse/human motif to TF tables we provide and match them to kilifish homologs (change the gene_name column with the kilifish homologs).

For your gene regions you can get 10kb upstream and downstream of each TSS (so 20kb regions) and/or another set of 500bp upstream and 100 bp downstream of the TSS for another database (depending on the species you might need to tweak those values a bit e.g 5kb up/down might work better for yours species).

You can use pycistopic tss to get some of the needed files relatively easily:

# Get Ensembl gene annotation name for Killifish.
$ pycistopic tss gene_annotation_list -f killifish
nfurzeri_gene_ensembl   Turquoise killifish genes (Nfu_20140520)

$ Get TSS for each gene from Ensembl.
$ pycistopic tss get_tss -n nfurzeri_gene_ensembl -o nfurzeri_gene_ensembl.tss.bed
- Get TSS annotation from Ensembl BioMart with the following settings:
  - biomart_name: "nfurzeri_gene_ensembl"
  - biomart_host: "http://www.ensembl.org"
  - transcript_type: ['protein_coding']
  - use_cache: True
- Writing TSS annotation BED file to "nfurzeri_gene_ensembl.tss.bed".

# Sort BED file with TSS.
$ sort -k 1,1 -k2,2n -k3,3n nfurzeri_gene_ensembl.tss.bed > nfurzeri_gene_ensembl.tss.sorted.bed

# Get chromosome name to size mapping.

# For most standard species the following would work, but not for Killifish apparently.

# Get killifish NCBI accession ID.
$ pycistopic tss get_ncbi_acc -s killifish
accession   assembly_name
GCF_027789165.1 UI_Nfuz_MZM_1.0
GCF_001465895.1 Nfu_20140520

$ pycistopic tss get_ncbi_chrom_sizes_and_alias_mapping --ncbi GCF_027789165.1 --chrom-sizes-alias nfurzeri.chrom_sizes_and_alias.tsv
Traceback (most recent call last):
  File "/home/luna.kuleuven.be/u0079808/software/anaconda3/envs/pycistopic_3.11/bin/pycistopic", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/luna.kuleuven.be/u0079808/PycharmProjects/pycisTopic/src/pycisTopic/cli/pycistopic.py", line 26, in main
    args.func(args)
  File "/home/luna.kuleuven.be/u0079808/PycharmProjects/pycisTopic/src/pycisTopic/cli/subcommand/tss.py", line 491, in run_tss_get_ncbi_chrom_sizes_and_alias_mapping
    get_chrom_sizes_and_alias_mapping_from_ncbi(
  File "/home/luna.kuleuven.be/u0079808/PycharmProjects/pycisTopic/src/pycisTopic/cli/subcommand/tss.py", line 398, in get_chrom_sizes_and_alias_mapping_from_ncbi
    ga.get_chrom_sizes_and_alias_mapping_from_ncbi(
  File "/home/luna.kuleuven.be/u0079808/PycharmProjects/pycisTopic/src/pycisTopic/gene_annotation.py", line 475, in get_chrom_sizes_and_alias_mapping_from_ncbi
    chrom_sizes_and_alias_df_pl = pl.from_dicts(reports).select(
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/luna.kuleuven.be/u0079808/software/anaconda3/envs/pycistopic_3.11/lib/python3.11/site-packages/polars/dataframe/frame.py", line 8873, in select
    return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/luna.kuleuven.be/u0079808/software/anaconda3/envs/pycistopic_3.11/lib/python3.11/site-packages/polars/lazyframe/frame.py", line 2034, in collect
    return wrap_df(ldf.collect(callback))
                   ^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ColumnNotFoundError: ucsc_style_name

# For Killifish you can use get_chromosome_sizes: https://gist.github.com/andrewyatz/a3687b573364f65904e2
./get_chromosome_sizes.py 'nothobranchius_furzeri' > nothobranchius_furzeri.chrom_sizes.tsv

# Create BED file with regions 20kb up and downstream of TSS.
$ bedtools slop -l 20000 -r 20000 -s -i nfurzeri_gene_ensembl.tss.sorted.bed -g nothobranchius_furzeri.chrom_sizes.tsv > nfurzeri_gene_ensembl.tss.sorted.20kb_up_and_down.bed

# Create BED file with regions 500bp upstream and 100bp downstream of TSS.
$ bedtools slop -l 500 -r 100 -s -i nfurzeri_gene_ensembl.tss.sorted.bed -g nothobranchius_furzeri.chrom_sizes.tsv > nfurzeri_gene_ensembl.tss.sorted.500bp_up_and_100bp_down.bed
ghuls commented 6 days ago
# Set Ensembl gene ID as gene column. 
❯ awk -F '\t' -v 'OFS=\t' '{print $1, $2, $3, $8}' nfurzeri_gene_ensembl.tss.sorted.20kb_up_and_down.bed > nfurzeri_gene_ensembl.tss.sorted.with_ensembl_gene_id.20kb_up_and_down.bed

# Set Ensembl gene ID as gene column. 
❯ awk -F '\t' -v 'OFS=\t' '{print $1, $2, $3, $8}' nfurzeri_gene_ensembl.tss.sorted.500bp_up_and_100bp_down.bed > nfurzeri_gene_ensembl.tss.sorted.with_ensembl_gene_id.500bp_up_and_100bp_down.bed

❯ head nfurzeri_gene_ensembl.tss.sorted.20kb_up_and_down.bed nfurzeri_gene_ensembl.tss.sorted.with_ensembl_gene_id.20kb_up_and_down.bed
==> nfurzeri_gene_ensembl.tss.sorted.20kb_up_and_down.bed <==
LN600519.1  357430  397431      .   -   protein_coding  ENSNFUG00015017460
LN600519.1  552219  592220  wdr43   .   -   protein_coding  ENSNFUG00015017465
LN600519.1  557140  597141  TRMT61B .   +   protein_coding  ENSNFUG00015017478
LN600519.1  562524  602525  spdya   .   -   protein_coding  ENSNFUG00015017485
LN600519.1  563086  603087  spdya   .   -   protein_coding  ENSNFUG00015017485
LN600519.1  563258  603259  spdya   .   -   protein_coding  ENSNFUG00015017485
LN600519.1  563784  603785      .   +   protein_coding  ENSNFUG00015017559
LN600519.1  611523  651524      .   -   protein_coding  ENSNFUG00015017563
LN600519.1  694463  734464      .   +   protein_coding  ENSNFUG00015017570
LN600519.1  789733  829734  LRFN2   .   -   protein_coding  ENSNFUG00015017567

==> nfurzeri_gene_ensembl.tss.sorted.with_ensembl_gene_id.20kb_up_and_down.bed <==
LN600519.1  357430  397431  ENSNFUG00015017460
LN600519.1  552219  592220  ENSNFUG00015017465
LN600519.1  557140  597141  ENSNFUG00015017478
LN600519.1  562524  602525  ENSNFUG00015017485
LN600519.1  563086  603087  ENSNFUG00015017485
LN600519.1  563258  603259  ENSNFUG00015017485
LN600519.1  563784  603785  ENSNFUG00015017559
LN600519.1  611523  651524  ENSNFUG00015017563
LN600519.1  694463  734464  ENSNFUG00015017570
LN600519.1  789733  829734  ENSNFUG00015017567

Files from previous post attached: nfurzeri_gene_ensembl.tss.sorted.zip

juhi-ku commented 5 hours ago

Hi, thank you! But where is this BED files needed? To form the create_cisTarget_databases I need fasta file and motif file right? The tutorial is quite confusing. Can you please just tell what to do after this?