Running run_scenicplus() without BioMart connection

sneddonucsf commented 1 year ago

Hello,

I am trying to run run_scenicplus() but unfortunately the HPC that I am using does not connect with outside networks due to security reasons for the BioMart portion. Is there a workaround I can try on the scenic+ side to get around this?

Thank you!

sneddonucsf commented 1 year ago

Hello,

I found the previous issue ([https://github.com/aertslab/scenicplus/issues/48]) and have tried following the suggestion you outline here: https://github.com/aertslab/scenicplus/issues/48#issuecomment-1285838142

I downloaded and modified the gene annotation file from BioMart http://www.ensembl.org/info/data/biomart/index.html resulting in a PyRanges object as following:

+--------------+-----------+-----------+--------------+------------+----------------------------+-------------------+
| Chromosome   | Start     | End       | Strand       | Gene       | Transcription_Start_Site   | Transcript_type   |
| (category)   | (int32)   | (int32)   | (category)   | (object)   | (int64)                    | (object)          |
|--------------+-----------+-----------+--------------+------------+----------------------------+-------------------|
| 1            | 1471765   | 1497848   | +            | ATAD3B     | 1471765                    | protein_coding    |
| 1            | 1471765   | 1497848   | +            | ATAD3B     | 1471784                    | protein_coding    |
| 1            | 3069168   | 3438621   | +            | PRDM16     | 3069168                    | protein_coding    |
| 1            | 3069168   | 3438621   | +            | PRDM16     | 3069197                    | protein_coding    |
| ...          | ...       | ...       | ...          | ...        | ...                        | ...               |
| Y            | 21903618  | 21918042  | -            | RBMY1E     | 21918032                   | protein_coding    |
| Y            | 21903618  | 21918042  | -            | RBMY1E     | 21918032                   | protein_coding    |
| Y            | 24045229  | 24048019  | -            | CDY1B      | 24047969                   | protein_coding    |
| Y            | 24045229  | 24048019  | -            | CDY1B      | 24048019                   | protein_coding    |
+--------------+-----------+-----------+--------------+------------+----------------------------+-------------------+
Stranded PyRanges object has 98,711 rows and 7 columns from 356 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.

For the Chromosome sizes, I downloaded the genome reference from 10x Genomics with

wget https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz

and used the provided genome.fa.fai file to create the Chromosome PyRanges object:

+--------------+-----------+-----------+
| Chromosome   | Start     | End       |
| (category)   | (int32)   | (int32)   |
|--------------+-----------+-----------|
| GL000008.2   | 0         | 209709    |
| GL000009.2   | 0         | 201709    |
| GL000194.1   | 0         | 191469    |
| GL000195.1   | 0         | 182896    |
| ...          | ...       | ...       |
| chr22        | 0         | 50818468  |
| chrM         | 0         | 16569     |
| chrX         | 0         | 156040895 |
| chrY         | 0         | 57227415  |
+--------------+-----------+-----------+
Unstranded PyRanges object has 194 rows and 3 columns from 194 chromosomes.
For printing, the PyRanges was sorted on Chromosome.

but unfortunately when trying to run:

from scenicplus.enhancer_to_gene import get_search_space
get_search_space(
        scplus_obj,
        species = None,
        assembly = None,
        pr_annot = pr_annot,
        pr_chromsizes = pr_chromsizes,
        upstream = [1000, 150000],
        downstream = [1000, 150000])

I got the following error:

2023-03-03 14:12:38,161 R2G          INFO     Extending promoter annotation to 10 bp upstream and 10 downstream
2023-03-03 14:12:50,467 R2G          INFO     Extending search space to:
                                    150000 bp downstream of the end of the gene.
                                    150000 bp upstream of the start of the gene.
2023-03-03 14:14:13,602 R2G          INFO     Intersecting with regions.
join: Strand data from other will be added as strand data to self.
If this is undesired use the flag apply_strand_suffix=False.
To turn off the warning set apply_strand_suffix to True or False.
Traceback (most recent call last):
  File "scenic+_BioMart.py", line 63, in <module>
    get_search_space(
  File "/wynton/home/sneddon/seandelao1991/scenic_plus/lib/python3.8/site-packages/scenicplus/enhancer_to_gene.py", line 399, in get_search_space
    regions_per_gene.End - regions_per_gene.Start).astype(np.int32)
  File "/wynton/home/sneddon/seandelao1991/scenic_plus/lib/python3.8/site-packages/pyranges/pyranges.py", line 269, in __getattr__
    return _getattr(self, name)
  File "/wynton/home/sneddon/seandelao1991/scenic_plus/lib/python3.8/site-packages/pyranges/methods/attr.py", line 67, in _getattr
    raise AttributeError("PyRanges object has no attribute", name)
AttributeError: ('PyRanges object has no attribute', 'End')

both PyRanges objects have the column label End so I'm not sure what might be going on?

SeppeDeWinter commented 1 year ago

Hi @sneddonucsf

Your annotation file does not have proper chromosome names (i.e. they are just numbers (1, 2, 3) instead of ('chr1', 'chr2', 'chr3', ...).

I hope this helps.

Best,

Seppe

sneddonucsf commented 1 year ago

Hi @SeppeDeWinter

That fixed it! Thank you very much. I am running into another issue now, however...

the get_search_space() is working fine now, but

    run_scenicplus(
        scplus_obj = scplus_obj,
        variable = ['Annotation'],
        species = 'hsapiens',
        assembly = 'hg38',
        tf_file = '/wynton/home/sneddon/seandelao1991/scenic_proj/input/utoronto_human_tfs_v_1.01.txt',
        save_path = os.path.join(outDir, 'scenicplus'),
        biomart_host = biomart_host,
        upstream = [1000, 150000],
        downstream = [1000, 150000],
        calculate_TF_eGRN_correlation = True,
        calculate_DEGs_DARs = True,
        export_to_loom_file = True,
        export_to_UCSC_file = True,
        n_cpu = 12,
        _temp_dir = os.path.join(tmpDir, 'ray_spill'))

runs fine until it gets to the Binarizing eGRNs AUC where it has been hanging for hours. I've tried this with 500GB of memory, but still no go. The function doesn't break, it is technically still "running" (based on my HPC) it just gets stuck at that part. Looking at the output of the tutorials, I believe this step should take less than 30mins, so I'm not sure what's going on. Any suggestions?

SeppeDeWinter commented 1 year ago

@sneddonucsf

Aha great.

From this message I can see that most of the analysis has completed. The "Binarizing eGRNs AUC" is not strictly necessary, it's only needed to export loom files. You can skip this by setting export_to_loom_file to False. At this point you scplus_obj should contain all of the important results.

Best,

Seppe

sneddonucsf commented 1 year ago

@SeppeDeWinter weirdly enough, with

run_scenicplus(
        scplus_obj = scplus_obj,
        variable = ['Annotation'],
        species = 'hsapiens',
        assembly = 'hg38',
        tf_file = '/wynton/home/sneddon/seandelao1991/scenic_proj/input/utoronto_human_tfs_v_1.01.txt',
        save_path = os.path.join(outDir, 'scenicplus'),
        biomart_host = biomart_host,
        upstream = [1000, 150000],
        downstream = [1000, 150000],
        calculate_TF_eGRN_correlation = True,
        calculate_DEGs_DARs = True,
        export_to_loom_file = False,
        export_to_UCSC_file = False,
        n_cpu = 24,
        _temp_dir = os.path.join(tmpDir, 'ray_spill'))

the function still tries to Binarizing eGRNs AUC and gets stuck. Since I did:

except Exception as e:
    #in case of failure, still save the object
    dill.dump(scplus_obj, open(os.path.join(work_dir, 'scenicplus/scplus_obj.pkl'), 'wb'), protocol=-1)
    raise(e)

is it safe to assume that my scplus_obj object is still saving along the way and that the results that I need for downstream analysis should be in that file?

sneddonucsf commented 1 year ago

Not sure what the issue was, but I followed the step by step tutorial instead of the wrapper and was able to complete the analysis that way. Thanks for all of your help!

aertslab / scenicplus

Running run_scenicplus() without BioMart connection #117