aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
185 stars 29 forks source link

How to run the run_pycistarget function locally? #48

Closed WhaleGe closed 2 years ago

WhaleGe commented 2 years ago

Hello, I found that running run_pycistarget requires networking. How can I run it locally, because the HPC cannot connect to external websites.

run_pycistarget(
...     region_sets = region_sets,
...     species = 'homo_sapiens',
...     save_path = os.path.join(work_dir, 'motifs'),
...     ctx_db_path = rankings_db,
...     dem_db_path = scores_db,
...     path_to_motif_annotations = motif_annotation,
...     run_without_promoters = True,
...     biomart_host = 'http://www.ensembl.org',  #need network
...     n_cpu = 8,
...     _temp_dir = tmp_dir,
...     annotation_version = 'v10nr_clust',
...     )

运行报错 raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 403 Client Error: Forbidden for url

SeppeDeWinter commented 2 years ago

Hi @WhaleGe

This is something that was not considered, so thank you for posting this bug report.

Indeed, how the wrapper function is written now it won't work without networking, this will be changed in the next release!

However, you can still run pycistarget but it will need some more manual work. The biomart host is only needed to balance the amount of promoter regions in foreground and background sets for running DEM and for removing promoter regions.

For now I would suggest to run pycistarget without the wrapper funciton, you can basically follow this tutorial for that: https://pycistarget.readthedocs.io/en/latest/pycistarget_scenic%2B_human_brain.html

The pseudocode to do something like this would look like:


menr = {}
for key in region_sets.keys():
   regions = region_sets[key]
   menr['CTX_'+key+'_All'] = run_cistarget(ctx_db = ctx_db,
                                   region_sets = regions,
                                   specie = species,
                                   auc_threshold = ctx_auc_threshold,
                                   nes_threshold = ctx_nes_threshold,
                                   rank_threshold = ctx_rank_threshold,
                                   annotation = annotation,
                                   motif_similarity_fdr = motif_similarity_fdr,
                                   path_to_motif_annotations = path_to_motif_annotations,
                                   n_cpu = n_cpu,
                                   _temp_dir= _temp_dir,
                                   annotation_version = annotation_version)
   menr['DEM_'+key+'_All'] = DEM(dem_db = dem_db,
                               region_sets = regions,
                               log2fc_thr = dem_log2fc_thr,
                               motif_hit_thr = dem_motif_hit_thr,
                               max_bg_regions = dem_max_bg_regions,
                               specie = species,
                               promoter_space = promoter_space,
                               motif_annotation =   annotation,
                               motif_similarity_fdr = motif_similarity_fdr, 
                               path_to_motif_annotations = path_to_motif_annotations,
                               n_cpu = n_cpu,
                               annotation_version = annotation_version,
                               tmp_dir = save_path,
                               _temp_dir= _temp_dir)

This will run the analysis without taking into account promoters.

If you still would want to take into account promoters you would have to download the genome annotation manually (from somewhere where you do have network access). And follow the template of this function: https://github.com/aertslab/scenicplus/blob/main/src/scenicplus/wrappers/run_pycistarget.py with you annotation.

Sorry for the inconvenience at the moment. Like I said, we will change this in the next release.

Best,

Seppe

WhaleGe commented 2 years ago

Hello, I ran according to the pseudo code you gave, but it reported an error again, I failed to use the source code to eliminate this bug

from pycistarget.motif_enrichment_cistarget import *
from pycistarget.motif_enrichment_dem import *
ctx_db_path = '/1.database/ScenicPlus/hg38_screen_v10_clust.regions_vs_motifs.rankings.feather'
dem_db_path='/1.database/ScenicPlus/hg38_screen_v10_clust.regions_vs_motifs.scores.feather'

regions = region_sets[key]
ctx_db = cisTargetDatabase(ctx_db_path, regions) 
dem_db = DEMDatabase(dem_db_path, regions) 

menr = {}
for key in region_sets.keys():
    regions = region_sets[key]
    menr['CTX_'+key+'_All'] = run_cistarget(ctx_db = ctx_db,
        region_sets = regions,
        specie = 'homo_sapiens',
        auc_threshold = 0.005,
        nes_threshold = 3.0,
        rank_threshold = 0.05,
        annotation = ['Direct_annot', 'Orthology_annot'],
        motif_similarity_fdr = 0.000001,
        path_to_motif_annotations = motif_annotation,
        n_cpu = 1,
        _temp_dir= tmp_dir,
        annotation_version = 'v10nr_clust')
    menr['DEM_'+key+'_All'] = DEM(dem_db = dem_db,
    region_sets = regions,
    log2fc_thr = 0.5,
    motif_hit_thr = 3.0,
    max_bg_regions = 500,
    specie = 'homo_sapiens',
    promoter_space = 500,
    motif_annotation =  ['Direct_annot', 'Orthology_annot'],
    motif_similarity_fdr = 0.000001, 
    path_to_motif_annotations = motif_annotation,
    n_cpu = 1,
    annotation_version = 'v10nr_clust',
    tmp_dir = "/ScenicPlus_tutorial/pbmc_tutorial/tmp",
    _temp_dir= tmp_dir)

The error is as follows:

Traceback (most recent call last): File "", line 3, in File "/jdfssz1/ST_SUPERCELLS/P21Z10200N0134/USER/huangwaidong/2.software/mn3/envs/scenicplus/lib/python3.8/site-packages/pycistarget/motif_enrichment_cistarget.py", line 543, in run_cistarget ctx_dict = [ctx_internal(ctx_db = ctx_db, File "/jdfssz1/ST_SUPERCELLS/P21Z10200N0134/USER/huangwaidong/2.software/mn3/envs/scenicplus/lib/python3.8/site-packages/pycistarget/motif_enrichment_cistarget.py", line 543, in ctx_dict = [ctx_internal(ctx_db = ctx_db, File "/jdfssz1/ST_SUPERCELLS/P21Z10200N0134/USER/huangwaidong/2.software/mn3/envs/scenicplus/lib/python3.8/site-packages/pycistarget/motif_enrichment_cistarget.py", line 707, in ctx_internal ctx_result.run_ctx(ctx_db) File "/jdfssz1/ST_SUPERCELLS/P21Z10200N0134/USER/huangwaidong/2.software/mn3/envs/scenicplus/lib/python3.8/site-packages/pycistarget/motif_enrichment_cistarget.py", line 299, in run_ctx self.regions_to_db = ctx_db.regions_to_db[self.name] if type(ctx_db.regions_to_db) == dict else ctx_db.regions_to_db.loc[set(coord_to_region_names(self.region_set)) & set(ctx_db.regions_to_db['Target'])] KeyError: 'Topic1'

Could you give me some suggestions? Thank you very much

cbravo93 commented 2 years ago

Hi @WhaleGe!

Can you show the keys of regions? I think in this case is a format issue.

C

WhaleGe commented 2 years ago

Is this?

for key in region_sets.keys(): ... print(f'{key}: {region_sets[key].keys()}') ... topics_otsu: dict_keys(['Topic1', 'Topic2', 'Topic3', 'Topic4', 'Topic5', 'Topic6', 'Topic7', 'Topic8', 'Topic9', 'Topic10', 'Topic11', 'Topic12', 'Topic13', 'Topic14', 'Topic15', 'Topic16']) topics_top_3: dict_keys(['Topic1', 'Topic2', 'Topic3', 'Topic4', 'Topic5', 'Topic6', 'Topic7', 'Topic8', 'Topic9', 'Topic10', 'Topic11', 'Topic12', 'Topic13', 'Topic14', 'Topic15', 'Topic16']) DARs: dict_keys(['B_cells_1', 'B_cells_2', 'CD14+_Monocytes', 'CD4_T_cells', 'CD8_T_cells', 'Dendritic_cells', 'FCGR3A+_Monocytes', 'NK_cells'])

cbravo93 commented 2 years ago

I meant here:

regions = region_sets[key]
ctx_db = cisTargetDatabase(ctx_db_path, regions) 
dem_db = DEMDatabase(dem_db_path, regions) 

How does regions look like here? What key are you using?

cbravo93 commented 2 years ago

I would update the code to this:

from pycistarget.motif_enrichment_cistarget import *
from pycistarget.motif_enrichment_dem import *
ctx_db_path = '/1.database/ScenicPlus/hg38_screen_v10_clust.regions_vs_motifs.rankings.feather'
dem_db_path='/1.database/ScenicPlus/hg38_screen_v10_clust.regions_vs_motifs.scores.feather'

menr = {}
for key in region_sets.keys():
    regions = region_sets[key]
    ctx_db = cisTargetDatabase(ctx_db_path, regions) 
    dem_db = DEMDatabase(dem_db_path, regions) 
    menr['CTX_'+key+'_All'] = run_cistarget(ctx_db = ctx_db,
        region_sets = regions,
        specie = 'homo_sapiens',
        auc_threshold = 0.005,
        nes_threshold = 3.0,
        rank_threshold = 0.05,
        annotation = ['Direct_annot', 'Orthology_annot'],
        motif_similarity_fdr = 0.000001,
        path_to_motif_annotations = motif_annotation,
        n_cpu = 1,
        _temp_dir= tmp_dir,
        annotation_version = 'v10nr_clust')
    menr['DEM_'+key+'_All'] = DEM(dem_db = dem_db,
    region_sets = regions,
    log2fc_thr = 0.5,
    motif_hit_thr = 3.0,
    max_bg_regions = 500,
    specie = 'homo_sapiens',
    promoter_space = 500,
    motif_annotation =  ['Direct_annot', 'Orthology_annot'],
    motif_similarity_fdr = 0.000001, 
    path_to_motif_annotations = motif_annotation,
    n_cpu = 1,
    annotation_version = 'v10nr_clust',
    tmp_dir = "/ScenicPlus_tutorial/pbmc_tutorial/tmp",
    _temp_dir= tmp_dir)
WhaleGe commented 2 years ago

Thanks a lot, no more mistakes here.

image

WhaleGe commented 2 years ago

Sorry, I found that running run_scenicplus also requires networking. Does this function have to be connected to the Internet? Can it be bypassed? Also, when will the next version be released?

cbravo93 commented 2 years ago

Hi @WhaleGe !

Im not sure what you are referring to. You can pass extra parameters from the function (**kwargs) to ray.init(): https://docs.ray.io/en/latest/ray-core/package-ref.html . I think setting address='local' could solve your issue? You can check further details in ray documentation (maybe open an issue there), this is what we use for parallelization.

SeppeDeWinter commented 2 years ago

hi @WhaleGe

I think @cbravo93 misunderstood you're question.

Indeed the run_scenicplus function requires networking by default, this to calculate the search space.

You can circumvent this however like this:

  1. download and load chromosome sizes and gene annotation as pyranges objects.

The gene annotation should look like this:

In [1]: pr_annot
Out [2]: 
+--------------+-----------+-----------+--------------+------------+----------------------------+-------------------+
| Chromosome   | Start     | End       | Strand       | Gene       | Transcription_Start_Site   | Transcript_type   |
| (category)   | (int32)   | (int32)   | (category)   | (object)   | (int64)                    | (object)          |
|--------------+-----------+-----------+--------------+------------+----------------------------+-------------------|
| chr1         | 1471765   | 1497848   | +            | ATAD3B     | 1471765                    | protein_coding    |
| chr1         | 1471765   | 1497848   | +            | ATAD3B     | 1471784                    | protein_coding    |
| chr1         | 3069168   | 3438621   | +            | PRDM16     | 3069168                    | protein_coding    |
| chr1         | 3069168   | 3438621   | +            | PRDM16     | 3069197                    | protein_coding    |
| ...          | ...       | ...       | ...          | ...        | ...                        | ...               |
| chrY         | 6865918   | 6911752   | -            | AMELY      | 6872608                    | protein_coding    |
| chrY         | 21903618  | 21918042  | -            | RBMY1E     | 21918032                   | protein_coding    |
| chrY         | 24045229  | 24048019  | -            | CDY1B      | 24047969                   | protein_coding    |
| chrY         | 24045229  | 24048019  | -            | CDY1B      | 24048019                   | protein_coding    |
+--------------+-----------+-----------+--------------+------------+----------------------------+-------------------+
Stranded PyRanges object has 85,990 rows and 7 columns from 330 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

Containing only information about protein coding genes. With Start end End representing resp. the start and the end of the gene. You can download this information locally via biomart http://www.ensembl.org/info/data/biomart/index.html.

The chromosome sizes should look like this:

In [3]: pr_chromsizes
Out [4]: 
+------------------------+-----------+-----------+
| Chromosome             | Start     | End       |
| (category)             | (int32)   | (int32)   |
|------------------------+-----------+-----------|
| chr1                   | 0         | 248956422 |
| chr1_GL383518v1_alt    | 0         | 182439    |
| chr1_GL383519v1_alt    | 0         | 110268    |
| chr1_GL383520v2_alt    | 0         | 366580    |
| ...                    | ...       | ...       |
| chrX_KI270881v1_alt    | 0         | 144206    |
| chrX_KI270913v1_alt    | 0         | 274009    |
| chrY                   | 0         | 57227415  |
| chrY_KI270740v1_random | 0         | 37240     |
+------------------------+-----------+-----------+
Unstranded PyRanges object has 455 rows and 3 columns from 455 chromosomes.
For printing, the PyRanges was sorted on Chromosome.

With Start always equal to 0 and End equal to the length of the chromosome. You can get this information from the fasta index file from your genome (.fa.fai file, see http://www.htslib.org/doc/faidx.html).

  1. Calculate search space for each gene, make sure to make use of the latest version of scenicplus, I encountered a bug which is now fixed https://github.com/aertslab/scenicplus/commit/2c0bafdb7b655bb089660f849eb63cfbc39d0828.

from scenicplus.enhancer_to_gene import get_search_space
get_search_space(
        scplus_obj,
        species = None,
        assembly = None,
        pr_annot = pr_annot,
        pr_chromsizes = pr_chromsizes,
        upstream = [1000, 150000],
        downstream = [1000, 150000])
  1. run the scenicplus wrapper, given that the search space is already calculated and saved in scplus_obj.uns['search_space'] the calculation will be skipped in the wrapper function and no networking should be required anymore.

from scenicplus.wrappers.run_scenicplus import run_scenicplus
try:
    run_scenicplus(
        scplus_obj = scplus_obj,
        variable = ['GEX_celltype'],
        species = 'hsapiens',
        assembly = 'hg38',
        tf_file = 'pbmc_tutorial/data/utoronto_human_tfs_v_1.01.txt',
        save_path = os.path.join(work_dir, 'scenicplus'),
        biomart_host = biomart_host,
        upstream = [1000, 150000],
        downstream = [1000, 150000],
        calculate_TF_eGRN_correlation = True,
        calculate_DEGs_DARs = True,
        export_to_loom_file = True,
        export_to_UCSC_file = True,
        path_bedToBigBed = 'pbmc_tutorial',
        n_cpu = 12,
        _temp_dir = os.path.join(tmp_dir, 'ray_spill'))
except Exception as e:
    #in case of failure, still save the object
    dill.dump(scplus_obj, open(os.path.join(work_dir, 'scenicplus/scplus_obj.pkl'), 'wb'), protocol=-1)
    raise(e)

Hope this helps,

Feel free to reopen the issue if not.

Best,

Seppe

WhaleGe commented 2 years ago

Thanks a lot. I bypassed this step of needing network and followed yours. But again it produces new error. I read it's (#30) but I didn't get it. I am using the PBMC data from the first tutorial.

Error output

Traceback (most recent call last):
  File "<stdin>", line 22, in <module>
  File "<stdin>", line 2, in <module>
  File "/hwfssz1/ST_SUPERCELLS/P21Z10200N0134/USER/huangwaidong/2.software/scenicplus/src/scenicplus/wrappers/run_scenicplus.py", line 128, in run_scenicplus
    merge_cistromes(scplus_obj)
  File "/hwfssz1/ST_SUPERCELLS/P21Z10200N0134/USER/huangwaidong/2.software/scenicplus/src/scenicplus/cistromes.py", line 86, in merge_cistromes
    raise ValueError("No cistromes found! Make sure that the motif enrichment results look good!")
ValueError: No cistromes found! Make sure that the motif enrichment results look good!

Since I'm running in the command line, I can't see the enriched results image

Could you offer some advice?

SeppeDeWinter commented 2 years ago

Hi

You can generate a html file like this, which you can open in any webbrowser:


with open('output.html', 'w') as f:
   f.write(menr['DEM_topics_otsu_All']).DEM_results('Topic8').data)

Could you please post further questions in the discussion tab instead of commenting to this issue?

Thanks,

Best

Seppe