SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Allowing custom species for pycistarget #56

Goultard59 commented 1 year ago

Tested with custom pig annotation, pycistarget working well.

SeppeDeWinter commented 1 year ago


This looks like a nice PR! Thank you, I was thinking of making this a bit of code a bit cleaner as well so nice to see some community effort.

Could you provide a snippet of code on how you used it with custom pig annotation? (might be nice to add to the tutorials as well).

I will give the code a try and if it works I'll merge it in the develop branch!



SidG13 commented 1 year ago

Hi, thank you both for making this tool more user-friendly. I used the modified from @Goultard59 and from my attempt, it works well up until the 'Creating contrast groups' step in the wrapper function where I get a KeyError.

KeyError: 'Chromosome'

KeyError: 'Chromosome'

In the modified code, I provide the explict path to my TSS annotation file (i.e., annot_dem = /path/to/file/), where the file is formatted as such:


Any suggestions would be helpful, thanks!

Goultard59 commented 1 year ago

@SidG13 you need to provide a pandas dataframe to the annot_dem parameters.

import pandas as pd
annot = pd.read_csv("/home/adufour/work/scenic_omics/annot.csv")

tmp_dir = "/home/adufour/work/tmp/"
from scenicplus.wrappers.run_pycistarget import run_pycistarget
    region_sets = region_sets,
    species = 'custom',
    custom_annot = tss,
    save_path = os.path.join(work_dir, 'motifs'),
    ctx_db_path = rankings_db,
    dem_db_path = scores_db,
    path_to_motif_annotations = motif_annotation,
    run_without_promoters = True,
    n_cpu = 5,
    _temp_dir = os.path.join(tmp_dir, 'ray_spill'),

Here is my R script to generate annotation files from a GTF


txdb_double <- makeTxDbFromGFF("/home/adufour/work/genome/Sus_scrofa.11.1.102.quant3p_extended.updated.sorted.gtf", format="gtf")


#get transcript from GTF
transcripts_list <- transcripts(txdb_double, columns=c("tx_id", "tx_name","gene_id","tx_type"))

#Format the dataframe properly
df <- annoGR2DF(transcripts_list)
df$gene_id <- unlist(df$gene_id)
df$strand <- as.character(df$strand)
df["strand"][df["strand"] == "+"] <- 1
df["strand"][df["strand"] == "-"] <- -1
df$tx_type <- "protein_coding"
df <- df[,c("chr", "start", "strand", "gene_id", "tx_type")]
colnames(df) <- c("Chromosome", "Start", "Strand", "Gene", "Transcript_type")


write.csv(df,"/home/adufour/work/scenic_omics/annot_cistarget.csv", row.names = FALSE)

#Now annotation for scenic plus
transcripts_list_2 <- transcripts(txdb_double, columns=c("tx_id", "tx_name","gene_id","tx_type"))

#keep genes for later
gf_genes <- GenomicFeatures::genes(txdb_double)
gene <- annoGR2DF(gf_genes)

#Get TSS position
tss_df <- annoGR2DF(transcripts_list_2)
tss_df$tss <- 0
tss_df$tss <- ifelse(tss_df$strand == "+", tss_df$start, tss_df$end)

tss_df$gene_id <- unlist(tss_df$gene_id)

gene_start = function(x){
  A = x[3]

  # return product
  return(gene$start[gene$gene_id == A])

gene_end = function(x){
  A = x[3]

  # return product
  return(gene$end[gene$gene_id == A])

#Add gene start/end
tss_df$Start <- apply(tss_df, 1, gene_start)
tss_df$End <- apply(tss_df, 1, gene_end)

#Format the dataframe properly
tss_df <- tss_df[,c("chr", "Start", "End", "strand", "gene_id", "tss", "tx_type")]
colnames(tss_df) <- c("Chromosome", "Start", "End", "Strand", "Gene", "Transcription_Start_Site", "Transcript_type")

write.csv(tss_df,"/home/adufour/work/scenic_omics/tss_scp.csv", row.names = FALSE)

#May be not the easiest way to get chromsize
genomeAnnotation <- createGenomeAnnotation(SuscrofaTxdb.11.102.fixed, standard = FALSE)

#Format the dataframe properly
genome_df <- annoGR2DF(genomeAnnotation@listData$chromSizes)
genome_df <- genome_df[,c("chr", "start", "end")]
colnames(genome_df) <- c("Chromosome", "Start", "End")

write.csv(genome_df,"/home/adufour/work/scenic_omics/chromsize.csv", row.names = FALSE)
SidG13 commented 1 year ago

Hi @Goultard59, thanks for your response. Maybe it's a very simple misunderstanding, but I'm trying to pass the pandas df containing my TSS information to the run_pyscenic wrapper, but I'm still getting a NameError.

Here's what I'm running:

from scenicplus.wrappers.run_pycistarget import run_pycistarget
import pandas as pd 

tmp_dir = '/home/administrator/Desktop/tmp'
tss = pd.read_csv('/home/administrator/Desktop/ExtraDrive1/Sid/sc_multiome/SCENICplus_multiome/data_manipulation/TSS_annot.txt', header=0, sep='\t')

    region_sets = region_sets,
    species = 'custom',
    custom_annot = tss, # also have tried annot_dem = tss
    save_path = os.path.join(work_dir, 'motifs'),
    ctx_db_path = rankings_db,
    dem_db_path = scores_db,
    path_to_motif_annotations = motif_annotation,
    run_without_promoters = True,
    n_cpu = 8,
    _temp_dir = os.path.join(tmp_dir, 'ray_spill'),
    annotation_version = 'v1',

But the error I'm getting is:

    145     annot, annot_dem = get_species_annotation('ggallus_gene_ensembl')
    146 elif species == 'custom':
--> 147     annot_dem = custom_annot
    148     annot = annot_dem.copy()
    149     # Define promoter space

NameError: name 'custom_annot' is not defined


SeppeDeWinter commented 1 year ago

@Goultard59 and @SidG13

Thanks for the example and thanks for testing.

The errorNameError: name 'custom_annot' is not defined is because custom_annot is not defined as a parameter in the run_pycistarget function.

@Goultard59 I'm now testing your PR with data from a non-model organisms. I did add the parameter to the function. I will push soon if everything is working and close this PR.



JoGraesslin commented 1 year ago

Hello, I would like to reopen this discussion, as I run into somewhat similar problems. I am trying to set up scenic plus for zebrafish. I used the Ensembl genome annotation for alignment and followed the workflow until pycistarget:

from scenicplus.wrappers.run_pycistarget import run_pycistarget
    region_sets = region_sets,
    save_path = os.path.join(work_dir, 'motifs'),
    ctx_db_path = rankings_db,
    dem_db_path = scores_db,
    path_to_motif_annotations = motif_annotation,
    run_without_promoters = True,
    n_cpu = 8,
    _temp_dir = os.path.join(tmp_dir, 'ray_spill'),

It runs without an Error message, but the output is missing annotation:


I have created a .tbl file from JASPAR using orthology data bases:


And a custom_annotation file based on the .gtf file looking like this


Do you have an idea what went wrong? I have used the Ensembl annotation all the way, does the function only work with UCSC chromosomes? I have also tried to update my .tbl file in the same way as SidG13 did. In this case, the Direct_annot and Orthology_annot columns are missing completely

Any help would be appreciated, since I don't know anymore what else to try. Thanks!

JoGraesslin commented 1 year ago

I have tried it now as well with the UCSC chromosomes, and I run into the same problem. Please let me know if you need more information.

SeppeDeWinter commented 1 year ago

Hi @JoGraesslin

Sorry for the late reply.

The reason you don't have any annotations is because your motif_annotation .tbl file is not formatted correctly. This code is used to load the file:

It requires having information in the description column (I know it is a bit cumbersome). See for example

Basically the description should contain the word "orthologous" in your case.

We should make this a bit more userfriendly.

I hope this helps.

Feel free to reach out if you still encounter issues.

p.s. as a tip. During troubleshooting you can always use to try to read the annotation file.
