Allowing custom species for pycistarget

Goultard59 commented 1 year ago

Tested with custom pig annotation, pycistarget working well.

SeppeDeWinter commented 1 year ago

Hi

This looks like a nice PR! Thank you, I was thinking of making this a bit of code a bit cleaner as well so nice to see some community effort.

Could you provide a snippet of code on how you used it with custom pig annotation? (might be nice to add to the tutorials as well).

I will give the code a try and if it works I'll merge it in the develop branch!

Best,

Seppe

SidG13 commented 1 year ago

Hi, thank you both for making this tool more user-friendly. I used the modified run_pycistarget.py from @Goultard59 and from my attempt, it works well up until the 'Creating contrast groups' step in the wrapper function where I get a KeyError.

KeyError: 'Chromosome'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In [18], line 6
      1 from scenicplus.wrappers.run_pycistarget import run_pycistarget
      3 #custom_annot = pd.read_csv('/home/administrator/Desktop/ExtraDrive1/Sid/sc_multiome/SCENICplus_multiome/data_manipulation/TSS_annot.txt', header=0, sep='\t')
      4 #custom_annot
----> 6 run_pycistarget(
      7     region_sets = region_sets,
      8     species = 'custom',
      9     save_path = os.path.join(work_dir, 'motifs'),
     10     ctx_db_path = rankings_db,
     11     dem_db_path = scores_db,
     12     path_to_motif_annotations = motif_annotation,
     13     run_without_promoters = True,
     14     n_cpu = 8,
     15     _temp_dir = os.path.join(tmp_dir, 'ray_spill'),
     16     annotation_version = 'v1',
     17     )

File ~/Desktop/ExtraDrive1/Sid/sc_multiome/SCENICplus_multiome/scenicplus/src/scenicplus/wrappers/run_pycistarget.py:243, in run_pycistarget(region_sets, species, save_path, save_partial, ctx_db_path, dem_db_path, run_without_promoters, biomart_host, promoter_space, ctx_auc_threshold, ctx_nes_threshold, ctx_rank_threshold, dem_log2fc_thr, dem_motif_hit_thr, dem_max_bg_regions, annotation, motif_similarity_fdr, path_to_motif_annotations, annotation_version, n_cpu, _temp_dir, exclude_motifs, exclude_collection, **kwargs)
    241     for col in exclude_collection:
    242         dem_db.db_scores = dem_db.db_scores[~dem_db.db_scores.index.str.contains(col)]
--> 243 menr['DEM_'+key+'_All'] = DEM(dem_db = dem_db,
    244                    region_sets = regions,
    245                    log2fc_thr = dem_log2fc_thr,
    246                    motif_hit_thr = dem_motif_hit_thr,
    247                    max_bg_regions = dem_max_bg_regions,
    248                    specie = species,
    249                    genome_annotation = annot_dem,
    250                    promoter_space = promoter_space,
    251                    motif_annotation =   annotation,
    252                    motif_similarity_fdr = motif_similarity_fdr, 
    253                    path_to_motif_annotations = path_to_motif_annotations,
    254                    n_cpu = n_cpu,
    255                    annotation_version = annotation_version,
    256                    tmp_dir = save_path,
    257                    _temp_dir= _temp_dir,
    258                    **kwargs)
    259 out_folder = os.path.join(save_path,'DEM_'+key+'_All')
    260 check_folder = os.path.isdir(out_folder)

File ~/anaconda2/envs/py38/lib/python3.8/site-packages/pycistarget/motif_enrichment_dem.py:319, in DEM.__init__(self, dem_db, region_sets, specie, subset_motifs, contrasts, name, max_bg_regions, adjpval_thr, log2fc_thr, mean_fg_thr, motif_hit_thr, n_cpu, fraction_overlap, cluster_buster_path, path_to_genome_fasta, path_to_motifs, genome_annotation, promoter_space, path_to_motif_annotations, annotation_version, motif_annotation, motif_similarity_fdr, orthologous_identity_threshold, tmp_dir, **kwargs)
    317 self.cistromes = None
    318 if dem_db is not None:
--> 319     self.run(dem_db.db_scores, **kwargs)

File ~/anaconda2/envs/py38/lib/python3.8/site-packages/pycistarget/motif_enrichment_dem.py:365, in DEM.run(self, dem_db_scores, **kwargs)
    363 # Get region groups
    364 log.info('Creating contrast groups')
--> 365 region_groups = [create_groups(contrast = contrasts[x],
    366                            region_sets_names = region_sets_names,
    367                            max_bg_regions = self.max_bg_regions,
    368                            path_to_genome_fasta = self.path_to_genome_fasta,
    369                            path_to_regions_fasta = os.path.join(self.tmp_dir, contrasts_names[x] +'.fa'),  
    370                            cbust_path = self.cluster_buster_path,
    371                            path_to_motifs = self.path_to_motifs,
    372                            annotation = self.genome_annotation,
    373                            promoter_space = self.promoter_space,
    374                            motifs = dem_db_scores.index.tolist(),
    375                            n_cpu = self.n_cpu,
    376                            **kwargs) for x in range(len(contrasts))]
    378 # Compute p-val and log2FC
    379 if self.n_cpu > len(region_groups):

File ~/anaconda2/envs/py38/lib/python3.8/site-packages/pycistarget/motif_enrichment_dem.py:365, in <listcomp>(.0)
    363 # Get region groups
    364 log.info('Creating contrast groups')
--> 365 region_groups = [create_groups(contrast = contrasts[x],
    366                            region_sets_names = region_sets_names,
    367                            max_bg_regions = self.max_bg_regions,
    368                            path_to_genome_fasta = self.path_to_genome_fasta,
    369                            path_to_regions_fasta = os.path.join(self.tmp_dir, contrasts_names[x] +'.fa'),  
    370                            cbust_path = self.cluster_buster_path,
    371                            path_to_motifs = self.path_to_motifs,
    372                            annotation = self.genome_annotation,
    373                            promoter_space = self.promoter_space,
    374                            motifs = dem_db_scores.index.tolist(),
    375                            n_cpu = self.n_cpu,
    376                            **kwargs) for x in range(len(contrasts))]
    378 # Compute p-val and log2FC
    379 if self.n_cpu > len(region_groups):

File ~/anaconda2/envs/py38/lib/python3.8/site-packages/pycistarget/motif_enrichment_dem.py:522, in create_groups(contrast, region_sets_names, max_bg_regions, path_to_genome_fasta, path_to_regions_fasta, cbust_path, path_to_motifs, annotation, promoter_space, motifs, n_cpu, **kwargs)
    520 # Nr of promoters in the foreground
    521 fg_pr_overlap = pr.PyRanges(region_names_to_coordinates(foreground)).count_overlaps(annotation)
--> 522 fg_pr = coord_to_region_names(fg_pr_overlap[fg_pr_overlap.NumberOverlaps != 0])
    523 if len(fg_pr) == len(foreground):
    524     nr_pr = max_bg_regions

File ~/anaconda2/envs/py38/lib/python3.8/site-packages/pycistarget/utils.py:18, in coord_to_region_names(coord)
     16 if isinstance(coord, pr.PyRanges):
     17     coord = coord.as_df()
---> 18     return list(coord['Chromosome'].astype(str) + ':' + coord['Start'].astype(str) + '-' + coord['End'].astype(str))

File ~/anaconda2/envs/py38/lib/python3.8/site-packages/pandas/core/frame.py:3805, in DataFrame.__getitem__(self, key)
   3803 if self.columns.nlevels > 1:
   3804     return self._getitem_multilevel(key)
-> 3805 indexer = self.columns.get_loc(key)
   3806 if is_integer(indexer):
   3807     indexer = [indexer]

File ~/anaconda2/envs/py38/lib/python3.8/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
   3800     return self._engine.get_loc(casted_key)
   3801 except KeyError as err:
-> 3802     raise KeyError(key) from err
   3803 except TypeError:
   3804     # If we have a listlike key, _check_indexing_error will raise
   3805     #  InvalidIndexError. Otherwise we fall through and re-raise
   3806     #  the TypeError.
   3807     self._check_indexing_error(key)

KeyError: 'Chromosome'

In the modified code, I provide the explict path to my TSS annotation file (i.e., annot_dem = /path/to/file/), where the file is formatted as such:

Any suggestions would be helpful, thanks!

Goultard59 commented 1 year ago

@SidG13 you need to provide a pandas dataframe to the annot_dem parameters.

import pandas as pd
annot = pd.read_csv("/home/adufour/work/scenic_omics/annot.csv")

tmp_dir = "/home/adufour/work/tmp/"
from scenicplus.wrappers.run_pycistarget import run_pycistarget
run_pycistarget(
    region_sets = region_sets,
    species = 'custom',
    custom_annot = tss,
    save_path = os.path.join(work_dir, 'motifs'),
    ctx_db_path = rankings_db,
    dem_db_path = scores_db,
    path_to_motif_annotations = motif_annotation,
    run_without_promoters = True,
    n_cpu = 5,
    _temp_dir = os.path.join(tmp_dir, 'ray_spill'),
    )

Here is my R script to generate annotation files from a GTF

library("Repitools")
library(GenomicRanges)
library(GenomicFeatures)
library(ArchR)
library(SuscrofaTxdb.11.102.fixed)

txdb_double <- makeTxDbFromGFF("/home/adufour/work/genome/Sus_scrofa.11.1.102.quant3p_extended.updated.sorted.gtf", format="gtf")

keytypes(txdb_double)

#get transcript from GTF
transcripts_list <- transcripts(txdb_double, columns=c("tx_id", "tx_name","gene_id","tx_type"))

#Format the dataframe properly
df <- annoGR2DF(transcripts_list)
df$gene_id <- unlist(df$gene_id)
df$strand <- as.character(df$strand)
df["strand"][df["strand"] == "+"] <- 1
df["strand"][df["strand"] == "-"] <- -1
df$tx_type <- "protein_coding"
df <- df[,c("chr", "start", "strand", "gene_id", "tx_type")]
colnames(df) <- c("Chromosome", "Start", "Strand", "Gene", "Transcript_type")

head(df)

write.csv(df,"/home/adufour/work/scenic_omics/annot_cistarget.csv", row.names = FALSE)

#Now annotation for scenic plus
transcripts_list_2 <- transcripts(txdb_double, columns=c("tx_id", "tx_name","gene_id","tx_type"))

#keep genes for later
gf_genes <- GenomicFeatures::genes(txdb_double)
gene <- annoGR2DF(gf_genes)

#Get TSS position
tss_df <- annoGR2DF(transcripts_list_2)
tss_df$tss <- 0
tss_df$tss <- ifelse(tss_df$strand == "+", tss_df$start, tss_df$end)

tss_df$gene_id <- unlist(tss_df$gene_id)

gene_start = function(x){
  A = x[3]

  # return product
  return(gene$start[gene$gene_id == A])
}

gene_end = function(x){
  A = x[3]

  # return product
  return(gene$end[gene$gene_id == A])
}

#Add gene start/end
tss_df$Start <- apply(tss_df, 1, gene_start)
tss_df$End <- apply(tss_df, 1, gene_end)

#Format the dataframe properly
tss_df <- tss_df[,c("chr", "Start", "End", "strand", "gene_id", "tss", "tx_type")]
colnames(tss_df) <- c("Chromosome", "Start", "End", "Strand", "Gene", "Transcription_Start_Site", "Transcript_type")

write.csv(tss_df,"/home/adufour/work/scenic_omics/tss_scp.csv", row.names = FALSE)

#May be not the easiest way to get chromsize
genomeAnnotation <- createGenomeAnnotation(SuscrofaTxdb.11.102.fixed, standard = FALSE)

#Format the dataframe properly
genome_df <- annoGR2DF(genomeAnnotation@listData$chromSizes)
genome_df <- genome_df[,c("chr", "start", "end")]
colnames(genome_df) <- c("Chromosome", "Start", "End")

write.csv(genome_df,"/home/adufour/work/scenic_omics/chromsize.csv", row.names = FALSE)

SidG13 commented 1 year ago

Hi @Goultard59, thanks for your response. Maybe it's a very simple misunderstanding, but I'm trying to pass the pandas df containing my TSS information to the run_pyscenic wrapper, but I'm still getting a NameError.

Here's what I'm running:

from scenicplus.wrappers.run_pycistarget import run_pycistarget
import pandas as pd 

tmp_dir = '/home/administrator/Desktop/tmp'
tss = pd.read_csv('/home/administrator/Desktop/ExtraDrive1/Sid/sc_multiome/SCENICplus_multiome/data_manipulation/TSS_annot.txt', header=0, sep='\t')

run_pycistarget(
    region_sets = region_sets,
    species = 'custom',
    custom_annot = tss, # also have tried annot_dem = tss
    save_path = os.path.join(work_dir, 'motifs'),
    ctx_db_path = rankings_db,
    dem_db_path = scores_db,
    path_to_motif_annotations = motif_annotation,
    run_without_promoters = True,
    n_cpu = 8,
    _temp_dir = os.path.join(tmp_dir, 'ray_spill'),
    annotation_version = 'v1',
    )

But the error I'm getting is:

    145     annot, annot_dem = get_species_annotation('ggallus_gene_ensembl')
    146 elif species == 'custom':
--> 147     annot_dem = custom_annot
    148     annot = annot_dem.copy()
    149     # Define promoter space

NameError: name 'custom_annot' is not defined

Thanks!

SeppeDeWinter commented 1 year ago

@Goultard59 and @SidG13

Thanks for the example and thanks for testing.

The errorNameError: name 'custom_annot' is not defined is because custom_annot is not defined as a parameter in the run_pycistarget function.

@Goultard59 I'm now testing your PR with data from a non-model organisms. I did add the parameter to the function. I will push soon if everything is working and close this PR.

Thanks.

Seppe

JoGraesslin commented 1 year ago

Hello, I would like to reopen this discussion, as I run into somewhat similar problems. I am trying to set up scenic plus for zebrafish. I used the Ensembl genome annotation for alignment and followed the workflow until pycistarget:

from scenicplus.wrappers.run_pycistarget import run_pycistarget
run_pycistarget(
    region_sets = region_sets,
    species='custom',
    custom_annot=custom_annotation,
    save_path = os.path.join(work_dir, 'motifs'),
    ctx_db_path = rankings_db,
    dem_db_path = scores_db,
    path_to_motif_annotations = motif_annotation,
    run_without_promoters = True,
    n_cpu = 8,
    _temp_dir = os.path.join(tmp_dir, 'ray_spill'),
    annotation_version="2020"
    )

It runs without an Error message, but the output is missing annotation:

I have created a .tbl file from JASPAR using orthology data bases:

And a custom_annotation file based on the .gtf file looking like this

Do you have an idea what went wrong? I have used the Ensembl annotation all the way, does the function only work with UCSC chromosomes? I have also tried to update my .tbl file in the same way as SidG13 did. In this case, the Direct_annot and Orthology_annot columns are missing completely

Any help would be appreciated, since I don't know anymore what else to try. Thanks!

JoGraesslin commented 1 year ago

I have tried it now as well with the UCSC chromosomes, and I run into the same problem. Please let me know if you need more information.

SeppeDeWinter commented 1 year ago

Hi @JoGraesslin

Sorry for the late reply.

The reason you don't have any annotations is because your motif_annotation .tbl file is not formatted correctly. This code is used to load the file: https://github.com/aertslab/pycistarget/blob/3fde1ce5e16241f00078bee920d8a805ccbf2360/pycistarget/utils.py#L65

It requires having information in the description column (I know it is a bit cumbersome). See for example https://resources.aertslab.org/cistarget/motif2tf/motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl

Basically the description should contain the word "orthologous" in your case.

We should make this a bit more userfriendly.

I hope this helps.

Feel free to reach out if you still encounter issues.

p.s. as a tip. During troubleshooting you can always use https://github.com/aertslab/pycistarget/blob/3fde1ce5e16241f00078bee920d8a805ccbf2360/pycistarget/utils.py#L65 to try to read the annotation file.

Seppe

aertslab / scenicplus

Allowing custom species for pycistarget #56