aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
165 stars 27 forks source link

Unable to run run_scenicplus with GRCz11: get_search_space #271

Open DmitriiSeverinov opened 6 months ago

DmitriiSeverinov commented 6 months ago

Hi all,

Thank you very much for such a great tool! I faced an issue when I was trying to run run_scenicplus:


>>> try:
...     run_scenicplus(
...         scplus_obj = scplus_obj,
...         variable = ['GEX_celltype'],
...         species = 'drerio',
...         assembly = 'GRCz11',
...         tf_file = os.path.join(work_dir, 'results/zf_retina/Danio_rerio.txt'),
...         save_path = os.path.join(work_dir, 'results/zf_retina/scenicplus'),
...         biomart_host = biomart_host,
...         upstream = [1000, 150000],
...         downstream = [1000, 150000],
...         calculate_TF_eGRN_correlation = True,
...         calculate_DEGs_DARs = True,
...         export_to_loom_file = True,
...         export_to_UCSC_file = True,
...         path_bedToBigBed = 'zf_retina',
...         n_cpu = 16,
...         _temp_dir = os.path.join(tmp_dir, 'ray_spill'))
... except Exception as e:
...     #in case of failure, still save the object
...     dill.dump(scplus_obj, open(os.path.join(work_dir, 'results/zf_retina/scenicplus/scplus_obj.pkl'), 'wb'), protocol=-1)
...     raise(e)
... 
2023-12-15 17:28:37,337 SCENIC+_wrapper INFO     Created folder : /Users/dmitrii/fish/results/zf_retina/scenicplus
2023-12-15 17:28:37,337 SCENIC+_wrapper INFO     Merging cistromes
2023-12-15 17:28:59,585 SCENIC+_wrapper INFO     Getting search space
Traceback (most recent call last):
  File "<stdin>", line 22, in <module>
  File "<stdin>", line 2, in <module>
  File "/Users/dmitrii/fish/scenicplus/src/scenicplus/wrappers/run_scenicplus.py", line 134, in run_scenicplus
    get_search_space(scplus_obj,
  File "/Users/dmitrii/fish/scenicplus/src/scenicplus/enhancer_to_gene.py", line 190, in get_search_space
    if not (ASM_SYNONYMS[assembly] in dataset_display_name or assembly in dataset_display_name):
KeyError: 'GRCz11'

Biomart host is: "http://sep2019.archive.ensembl.org/"

I've checked the line in enhancer_to_gene.py (lines from 179 to 200 - get_search_space function):

# GET GENE ANNOTATION AND CHROMSIZES
if species is not None and assembly is not None:
    # Download gene annotation and chromsizes
    # 1. Download gene annotation from biomart
    import pybiomart as pbm
    dataset_name = '{}_gene_ensembl'.format(species)
    server = pbm.Server(host=biomart_host, use_cache=False)
    mart = server['ENSEMBL_MART_ENSEMBL']
    # check if biomart host is correct
    dataset_display_name = getattr(
        mart.datasets[dataset_name], 'display_name')
    if not (ASM_SYNONYMS[assembly] in dataset_display_name or assembly in dataset_display_name):
        print(
            f'\u001b[31m!! The provided assembly {assembly} does not match the biomart host ({dataset_display_name}).\n Please check biomart host parameter\u001b[0m\nFor more info see: https://m.ensembl.org/info/website/archives/assembly.html')
    # check wether dataset can be accessed.
    if dataset_name not in mart.list_datasets()['name'].to_numpy():
        raise Exception(
            '{} could not be found as a dataset in biomart. Check species name or consider manually providing gene annotations!')
    else:
        log.info(
            "Downloading gene annotation from biomart dataset: {}".format(dataset_name))
        dataset = mart[dataset_name]

And when I try to run it line by line - everything is fine:

>>> import pybiomart as pbm
>>> dataset_name = '{}_gene_ensembl'.format('drerio')
>>> server = pbm.Server(host="http://sep2019.archive.ensembl.org/", use_cache=False)
>>> dataset_name
'drerio_gene_ensembl'
>>> mart = server['ENSEMBL_MART_ENSEMBL']
>>> dataset_display_name = getattr(
...             mart.datasets[dataset_name], 'display_name')
>>> dataset_display_name
'Zebrafish genes (GRCz11)'
>>> 'GRCz11' in dataset_display_name
True
>>> (not 'GRCz11' in dataset_display_name)
False
>>> dataset_name not in mart.list_datasets()['name'].to_numpy()
False
>>> dataset = mart[dataset_name]
>>> dataset
<biomart.Dataset name='drerio_gene_ensembl', display_name='Zebrafish genes (GRCz11)'>

Then I tried to run the get_search_space function:

>>> get_search_space(scplus_obj,
...                      biomart_host = biomart_host,
...                      species = 'drerio',
...                      assembly = 'GRCz11', 
...                      upstream = [1000, 150000],
...                      downstream = [1000, 150000])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dmitrii/fish/scenicplus/src/scenicplus/enhancer_to_gene.py", line 190, in get_search_space
    if not (ASM_SYNONYMS[assembly] in dataset_display_name or assembly in dataset_display_name):
KeyError: 'GRCz11'

The error refers me to lines that I already checked, the lines that I managed to run without issues... Could you please help? If you have any questions, I am happy to answer

>>> print(scenicplus.__version__)
1.0.1.dev4+ge4bdd9f
SeppeDeWinter commented 6 months ago

Hi @DmitriiSeverinov

Please see this section in the tutorial: https://scenicplus.readthedocs.io/en/development/pbmc_multiome_tutorial.html#Notes-on-the-individual-function-calls-in-the-wrapper-function. (after Notes-on-the-individual-function-calls-in-the-wrapper-function) on how to run the wrapper function for species different from human, mouse or fly.

All the best,

Seppe