Closed Goultard59 closed 1 year ago
Hi
This looks like a nice PR! Thank you, I was thinking of making this a bit of code a bit cleaner as well so nice to see some community effort.
Could you provide a snippet of code on how you used it with custom pig annotation? (might be nice to add to the tutorials as well).
I will give the code a try and if it works I'll merge it in the develop branch!
Best,
Seppe
Hi, thank you both for making this tool more user-friendly. I used the modified run_pycistarget.py
from @Goultard59 and from my attempt, it works well up until the 'Creating contrast groups'
step in the wrapper function where I get a KeyError.
KeyError: 'Chromosome'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In [18], line 6
1 from scenicplus.wrappers.run_pycistarget import run_pycistarget
3 #custom_annot = pd.read_csv('/home/administrator/Desktop/ExtraDrive1/Sid/sc_multiome/SCENICplus_multiome/data_manipulation/TSS_annot.txt', header=0, sep='\t')
4 #custom_annot
----> 6 run_pycistarget(
7 region_sets = region_sets,
8 species = 'custom',
9 save_path = os.path.join(work_dir, 'motifs'),
10 ctx_db_path = rankings_db,
11 dem_db_path = scores_db,
12 path_to_motif_annotations = motif_annotation,
13 run_without_promoters = True,
14 n_cpu = 8,
15 _temp_dir = os.path.join(tmp_dir, 'ray_spill'),
16 annotation_version = 'v1',
17 )
File ~/Desktop/ExtraDrive1/Sid/sc_multiome/SCENICplus_multiome/scenicplus/src/scenicplus/wrappers/run_pycistarget.py:243, in run_pycistarget(region_sets, species, save_path, save_partial, ctx_db_path, dem_db_path, run_without_promoters, biomart_host, promoter_space, ctx_auc_threshold, ctx_nes_threshold, ctx_rank_threshold, dem_log2fc_thr, dem_motif_hit_thr, dem_max_bg_regions, annotation, motif_similarity_fdr, path_to_motif_annotations, annotation_version, n_cpu, _temp_dir, exclude_motifs, exclude_collection, **kwargs)
241 for col in exclude_collection:
242 dem_db.db_scores = dem_db.db_scores[~dem_db.db_scores.index.str.contains(col)]
--> 243 menr['DEM_'+key+'_All'] = DEM(dem_db = dem_db,
244 region_sets = regions,
245 log2fc_thr = dem_log2fc_thr,
246 motif_hit_thr = dem_motif_hit_thr,
247 max_bg_regions = dem_max_bg_regions,
248 specie = species,
249 genome_annotation = annot_dem,
250 promoter_space = promoter_space,
251 motif_annotation = annotation,
252 motif_similarity_fdr = motif_similarity_fdr,
253 path_to_motif_annotations = path_to_motif_annotations,
254 n_cpu = n_cpu,
255 annotation_version = annotation_version,
256 tmp_dir = save_path,
257 _temp_dir= _temp_dir,
258 **kwargs)
259 out_folder = os.path.join(save_path,'DEM_'+key+'_All')
260 check_folder = os.path.isdir(out_folder)
File ~/anaconda2/envs/py38/lib/python3.8/site-packages/pycistarget/motif_enrichment_dem.py:319, in DEM.__init__(self, dem_db, region_sets, specie, subset_motifs, contrasts, name, max_bg_regions, adjpval_thr, log2fc_thr, mean_fg_thr, motif_hit_thr, n_cpu, fraction_overlap, cluster_buster_path, path_to_genome_fasta, path_to_motifs, genome_annotation, promoter_space, path_to_motif_annotations, annotation_version, motif_annotation, motif_similarity_fdr, orthologous_identity_threshold, tmp_dir, **kwargs)
317 self.cistromes = None
318 if dem_db is not None:
--> 319 self.run(dem_db.db_scores, **kwargs)
File ~/anaconda2/envs/py38/lib/python3.8/site-packages/pycistarget/motif_enrichment_dem.py:365, in DEM.run(self, dem_db_scores, **kwargs)
363 # Get region groups
364 log.info('Creating contrast groups')
--> 365 region_groups = [create_groups(contrast = contrasts[x],
366 region_sets_names = region_sets_names,
367 max_bg_regions = self.max_bg_regions,
368 path_to_genome_fasta = self.path_to_genome_fasta,
369 path_to_regions_fasta = os.path.join(self.tmp_dir, contrasts_names[x] +'.fa'),
370 cbust_path = self.cluster_buster_path,
371 path_to_motifs = self.path_to_motifs,
372 annotation = self.genome_annotation,
373 promoter_space = self.promoter_space,
374 motifs = dem_db_scores.index.tolist(),
375 n_cpu = self.n_cpu,
376 **kwargs) for x in range(len(contrasts))]
378 # Compute p-val and log2FC
379 if self.n_cpu > len(region_groups):
File ~/anaconda2/envs/py38/lib/python3.8/site-packages/pycistarget/motif_enrichment_dem.py:365, in <listcomp>(.0)
363 # Get region groups
364 log.info('Creating contrast groups')
--> 365 region_groups = [create_groups(contrast = contrasts[x],
366 region_sets_names = region_sets_names,
367 max_bg_regions = self.max_bg_regions,
368 path_to_genome_fasta = self.path_to_genome_fasta,
369 path_to_regions_fasta = os.path.join(self.tmp_dir, contrasts_names[x] +'.fa'),
370 cbust_path = self.cluster_buster_path,
371 path_to_motifs = self.path_to_motifs,
372 annotation = self.genome_annotation,
373 promoter_space = self.promoter_space,
374 motifs = dem_db_scores.index.tolist(),
375 n_cpu = self.n_cpu,
376 **kwargs) for x in range(len(contrasts))]
378 # Compute p-val and log2FC
379 if self.n_cpu > len(region_groups):
File ~/anaconda2/envs/py38/lib/python3.8/site-packages/pycistarget/motif_enrichment_dem.py:522, in create_groups(contrast, region_sets_names, max_bg_regions, path_to_genome_fasta, path_to_regions_fasta, cbust_path, path_to_motifs, annotation, promoter_space, motifs, n_cpu, **kwargs)
520 # Nr of promoters in the foreground
521 fg_pr_overlap = pr.PyRanges(region_names_to_coordinates(foreground)).count_overlaps(annotation)
--> 522 fg_pr = coord_to_region_names(fg_pr_overlap[fg_pr_overlap.NumberOverlaps != 0])
523 if len(fg_pr) == len(foreground):
524 nr_pr = max_bg_regions
File ~/anaconda2/envs/py38/lib/python3.8/site-packages/pycistarget/utils.py:18, in coord_to_region_names(coord)
16 if isinstance(coord, pr.PyRanges):
17 coord = coord.as_df()
---> 18 return list(coord['Chromosome'].astype(str) + ':' + coord['Start'].astype(str) + '-' + coord['End'].astype(str))
File ~/anaconda2/envs/py38/lib/python3.8/site-packages/pandas/core/frame.py:3805, in DataFrame.__getitem__(self, key)
3803 if self.columns.nlevels > 1:
3804 return self._getitem_multilevel(key)
-> 3805 indexer = self.columns.get_loc(key)
3806 if is_integer(indexer):
3807 indexer = [indexer]
File ~/anaconda2/envs/py38/lib/python3.8/site-packages/pandas/core/indexes/base.py:3802, in Index.get_loc(self, key, method, tolerance)
3800 return self._engine.get_loc(casted_key)
3801 except KeyError as err:
-> 3802 raise KeyError(key) from err
3803 except TypeError:
3804 # If we have a listlike key, _check_indexing_error will raise
3805 # InvalidIndexError. Otherwise we fall through and re-raise
3806 # the TypeError.
3807 self._check_indexing_error(key)
KeyError: 'Chromosome'
In the modified code, I provide the explict path to my TSS annotation file (i.e., annot_dem = /path/to/file/), where the file is formatted as such:
Any suggestions would be helpful, thanks!
@SidG13 you need to provide a pandas dataframe to the annot_dem parameters.
import pandas as pd
annot = pd.read_csv("/home/adufour/work/scenic_omics/annot.csv")
tmp_dir = "/home/adufour/work/tmp/"
from scenicplus.wrappers.run_pycistarget import run_pycistarget
run_pycistarget(
region_sets = region_sets,
species = 'custom',
custom_annot = tss,
save_path = os.path.join(work_dir, 'motifs'),
ctx_db_path = rankings_db,
dem_db_path = scores_db,
path_to_motif_annotations = motif_annotation,
run_without_promoters = True,
n_cpu = 5,
_temp_dir = os.path.join(tmp_dir, 'ray_spill'),
)
Here is my R script to generate annotation files from a GTF
library("Repitools")
library(GenomicRanges)
library(GenomicFeatures)
library(ArchR)
library(SuscrofaTxdb.11.102.fixed)
txdb_double <- makeTxDbFromGFF("/home/adufour/work/genome/Sus_scrofa.11.1.102.quant3p_extended.updated.sorted.gtf", format="gtf")
keytypes(txdb_double)
#get transcript from GTF
transcripts_list <- transcripts(txdb_double, columns=c("tx_id", "tx_name","gene_id","tx_type"))
#Format the dataframe properly
df <- annoGR2DF(transcripts_list)
df$gene_id <- unlist(df$gene_id)
df$strand <- as.character(df$strand)
df["strand"][df["strand"] == "+"] <- 1
df["strand"][df["strand"] == "-"] <- -1
df$tx_type <- "protein_coding"
df <- df[,c("chr", "start", "strand", "gene_id", "tx_type")]
colnames(df) <- c("Chromosome", "Start", "Strand", "Gene", "Transcript_type")
head(df)
write.csv(df,"/home/adufour/work/scenic_omics/annot_cistarget.csv", row.names = FALSE)
#Now annotation for scenic plus
transcripts_list_2 <- transcripts(txdb_double, columns=c("tx_id", "tx_name","gene_id","tx_type"))
#keep genes for later
gf_genes <- GenomicFeatures::genes(txdb_double)
gene <- annoGR2DF(gf_genes)
#Get TSS position
tss_df <- annoGR2DF(transcripts_list_2)
tss_df$tss <- 0
tss_df$tss <- ifelse(tss_df$strand == "+", tss_df$start, tss_df$end)
tss_df$gene_id <- unlist(tss_df$gene_id)
gene_start = function(x){
A = x[3]
# return product
return(gene$start[gene$gene_id == A])
}
gene_end = function(x){
A = x[3]
# return product
return(gene$end[gene$gene_id == A])
}
#Add gene start/end
tss_df$Start <- apply(tss_df, 1, gene_start)
tss_df$End <- apply(tss_df, 1, gene_end)
#Format the dataframe properly
tss_df <- tss_df[,c("chr", "Start", "End", "strand", "gene_id", "tss", "tx_type")]
colnames(tss_df) <- c("Chromosome", "Start", "End", "Strand", "Gene", "Transcription_Start_Site", "Transcript_type")
write.csv(tss_df,"/home/adufour/work/scenic_omics/tss_scp.csv", row.names = FALSE)
#May be not the easiest way to get chromsize
genomeAnnotation <- createGenomeAnnotation(SuscrofaTxdb.11.102.fixed, standard = FALSE)
#Format the dataframe properly
genome_df <- annoGR2DF(genomeAnnotation@listData$chromSizes)
genome_df <- genome_df[,c("chr", "start", "end")]
colnames(genome_df) <- c("Chromosome", "Start", "End")
write.csv(genome_df,"/home/adufour/work/scenic_omics/chromsize.csv", row.names = FALSE)
Hi @Goultard59, thanks for your response. Maybe it's a very simple misunderstanding, but I'm trying to pass the pandas df containing my TSS information to the run_pyscenic wrapper, but I'm still getting a NameError
.
Here's what I'm running:
from scenicplus.wrappers.run_pycistarget import run_pycistarget
import pandas as pd
tmp_dir = '/home/administrator/Desktop/tmp'
tss = pd.read_csv('/home/administrator/Desktop/ExtraDrive1/Sid/sc_multiome/SCENICplus_multiome/data_manipulation/TSS_annot.txt', header=0, sep='\t')
run_pycistarget(
region_sets = region_sets,
species = 'custom',
custom_annot = tss, # also have tried annot_dem = tss
save_path = os.path.join(work_dir, 'motifs'),
ctx_db_path = rankings_db,
dem_db_path = scores_db,
path_to_motif_annotations = motif_annotation,
run_without_promoters = True,
n_cpu = 8,
_temp_dir = os.path.join(tmp_dir, 'ray_spill'),
annotation_version = 'v1',
)
But the error I'm getting is:
145 annot, annot_dem = get_species_annotation('ggallus_gene_ensembl')
146 elif species == 'custom':
--> 147 annot_dem = custom_annot
148 annot = annot_dem.copy()
149 # Define promoter space
NameError: name 'custom_annot' is not defined
Thanks!
@Goultard59 and @SidG13
Thanks for the example and thanks for testing.
The errorNameError: name 'custom_annot' is not defined
is because custom_annot
is not defined as a parameter in the run_pycistarget
function.
@Goultard59 I'm now testing your PR with data from a non-model organisms. I did add the parameter to the function. I will push soon if everything is working and close this PR.
Thanks.
Seppe
Hello, I would like to reopen this discussion, as I run into somewhat similar problems. I am trying to set up scenic plus for zebrafish. I used the Ensembl genome annotation for alignment and followed the workflow until pycistarget:
from scenicplus.wrappers.run_pycistarget import run_pycistarget
run_pycistarget(
region_sets = region_sets,
species='custom',
custom_annot=custom_annotation,
save_path = os.path.join(work_dir, 'motifs'),
ctx_db_path = rankings_db,
dem_db_path = scores_db,
path_to_motif_annotations = motif_annotation,
run_without_promoters = True,
n_cpu = 8,
_temp_dir = os.path.join(tmp_dir, 'ray_spill'),
annotation_version="2020"
)
It runs without an Error message, but the output is missing annotation:
I have created a .tbl file from JASPAR using orthology data bases:
And a custom_annotation file based on the .gtf file looking like this
Do you have an idea what went wrong? I have used the Ensembl annotation all the way, does the function only work with UCSC chromosomes? I have also tried to update my .tbl file in the same way as SidG13 did. In this case, the Direct_annot and Orthology_annot columns are missing completely
Any help would be appreciated, since I don't know anymore what else to try. Thanks!
I have tried it now as well with the UCSC chromosomes, and I run into the same problem. Please let me know if you need more information.
Hi @JoGraesslin
Sorry for the late reply.
The reason you don't have any annotations is because your motif_annotation .tbl file is not formatted correctly. This code is used to load the file: https://github.com/aertslab/pycistarget/blob/3fde1ce5e16241f00078bee920d8a805ccbf2360/pycistarget/utils.py#L65
It requires having information in the description
column (I know it is a bit cumbersome).
See for example https://resources.aertslab.org/cistarget/motif2tf/motifs-v10nr_clust-nr.hgnc-m0.001-o0.0.tbl
Basically the description should contain the word "orthologous" in your case.
We should make this a bit more userfriendly.
I hope this helps.
Feel free to reach out if you still encounter issues.
p.s. as a tip. During troubleshooting you can always use https://github.com/aertslab/pycistarget/blob/3fde1ce5e16241f00078bee920d8a805ccbf2360/pycistarget/utils.py#L65 to try to read the annotation file.
Seppe
Tested with custom pig annotation, pycistarget working well.