Closed sneddonucsf closed 1 year ago
Hello,
I found the previous issue ([https://github.com/aertslab/scenicplus/issues/48]) and have tried following the suggestion you outline here: https://github.com/aertslab/scenicplus/issues/48#issuecomment-1285838142
I downloaded and modified the gene annotation file from BioMart http://www.ensembl.org/info/data/biomart/index.html resulting in a PyRanges object as following:
+--------------+-----------+-----------+--------------+------------+----------------------------+-------------------+
| Chromosome | Start | End | Strand | Gene | Transcription_Start_Site | Transcript_type |
| (category) | (int32) | (int32) | (category) | (object) | (int64) | (object) |
|--------------+-----------+-----------+--------------+------------+----------------------------+-------------------|
| 1 | 1471765 | 1497848 | + | ATAD3B | 1471765 | protein_coding |
| 1 | 1471765 | 1497848 | + | ATAD3B | 1471784 | protein_coding |
| 1 | 3069168 | 3438621 | + | PRDM16 | 3069168 | protein_coding |
| 1 | 3069168 | 3438621 | + | PRDM16 | 3069197 | protein_coding |
| ... | ... | ... | ... | ... | ... | ... |
| Y | 21903618 | 21918042 | - | RBMY1E | 21918032 | protein_coding |
| Y | 21903618 | 21918042 | - | RBMY1E | 21918032 | protein_coding |
| Y | 24045229 | 24048019 | - | CDY1B | 24047969 | protein_coding |
| Y | 24045229 | 24048019 | - | CDY1B | 24048019 | protein_coding |
+--------------+-----------+-----------+--------------+------------+----------------------------+-------------------+
Stranded PyRanges object has 98,711 rows and 7 columns from 356 chromosomes.
For printing, the PyRanges was sorted on Chromosome and Strand.
For the Chromosome sizes, I downloaded the genome reference from 10x Genomics with
wget https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz
and used the provided genome.fa.fai
file to create the Chromosome PyRanges object:
+--------------+-----------+-----------+
| Chromosome | Start | End |
| (category) | (int32) | (int32) |
|--------------+-----------+-----------|
| GL000008.2 | 0 | 209709 |
| GL000009.2 | 0 | 201709 |
| GL000194.1 | 0 | 191469 |
| GL000195.1 | 0 | 182896 |
| ... | ... | ... |
| chr22 | 0 | 50818468 |
| chrM | 0 | 16569 |
| chrX | 0 | 156040895 |
| chrY | 0 | 57227415 |
+--------------+-----------+-----------+
Unstranded PyRanges object has 194 rows and 3 columns from 194 chromosomes.
For printing, the PyRanges was sorted on Chromosome.
but unfortunately when trying to run:
from scenicplus.enhancer_to_gene import get_search_space
get_search_space(
scplus_obj,
species = None,
assembly = None,
pr_annot = pr_annot,
pr_chromsizes = pr_chromsizes,
upstream = [1000, 150000],
downstream = [1000, 150000])
I got the following error:
2023-03-03 14:12:38,161 R2G INFO Extending promoter annotation to 10 bp upstream and 10 downstream
2023-03-03 14:12:50,467 R2G INFO Extending search space to:
150000 bp downstream of the end of the gene.
150000 bp upstream of the start of the gene.
2023-03-03 14:14:13,602 R2G INFO Intersecting with regions.
join: Strand data from other will be added as strand data to self.
If this is undesired use the flag apply_strand_suffix=False.
To turn off the warning set apply_strand_suffix to True or False.
Traceback (most recent call last):
File "scenic+_BioMart.py", line 63, in <module>
get_search_space(
File "/wynton/home/sneddon/seandelao1991/scenic_plus/lib/python3.8/site-packages/scenicplus/enhancer_to_gene.py", line 399, in get_search_space
regions_per_gene.End - regions_per_gene.Start).astype(np.int32)
File "/wynton/home/sneddon/seandelao1991/scenic_plus/lib/python3.8/site-packages/pyranges/pyranges.py", line 269, in __getattr__
return _getattr(self, name)
File "/wynton/home/sneddon/seandelao1991/scenic_plus/lib/python3.8/site-packages/pyranges/methods/attr.py", line 67, in _getattr
raise AttributeError("PyRanges object has no attribute", name)
AttributeError: ('PyRanges object has no attribute', 'End')
both PyRanges objects have the column label End
so I'm not sure what might be going on?
Hi @sneddonucsf
Your annotation file does not have proper chromosome names (i.e. they are just numbers (1, 2, 3
) instead of ('chr1', 'chr2', 'chr3', ...
).
I hope this helps.
Best,
Seppe
Hi @SeppeDeWinter
That fixed it! Thank you very much. I am running into another issue now, however...
the get_search_space()
is working fine now, but
run_scenicplus(
scplus_obj = scplus_obj,
variable = ['Annotation'],
species = 'hsapiens',
assembly = 'hg38',
tf_file = '/wynton/home/sneddon/seandelao1991/scenic_proj/input/utoronto_human_tfs_v_1.01.txt',
save_path = os.path.join(outDir, 'scenicplus'),
biomart_host = biomart_host,
upstream = [1000, 150000],
downstream = [1000, 150000],
calculate_TF_eGRN_correlation = True,
calculate_DEGs_DARs = True,
export_to_loom_file = True,
export_to_UCSC_file = True,
n_cpu = 12,
_temp_dir = os.path.join(tmpDir, 'ray_spill'))
runs fine until it gets to the Binarizing eGRNs AUC
where it has been hanging for hours. I've tried this with 500GB of memory, but still no go. The function doesn't break, it is technically still "running" (based on my HPC) it just gets stuck at that part. Looking at the output of the tutorials, I believe this step should take less than 30mins, so I'm not sure what's going on. Any suggestions?
@sneddonucsf
Aha great.
From this message I can see that most of the analysis has completed. The "Binarizing eGRNs AUC" is not strictly necessary, it's only needed to export loom files. You can skip this by setting export_to_loom_file
to False
.
At this point you scplus_obj should contain all of the important results.
Best,
Seppe
@SeppeDeWinter weirdly enough, with
run_scenicplus(
scplus_obj = scplus_obj,
variable = ['Annotation'],
species = 'hsapiens',
assembly = 'hg38',
tf_file = '/wynton/home/sneddon/seandelao1991/scenic_proj/input/utoronto_human_tfs_v_1.01.txt',
save_path = os.path.join(outDir, 'scenicplus'),
biomart_host = biomart_host,
upstream = [1000, 150000],
downstream = [1000, 150000],
calculate_TF_eGRN_correlation = True,
calculate_DEGs_DARs = True,
export_to_loom_file = False,
export_to_UCSC_file = False,
n_cpu = 24,
_temp_dir = os.path.join(tmpDir, 'ray_spill'))
the function still tries to Binarizing eGRNs AUC
and gets stuck. Since I did:
except Exception as e:
#in case of failure, still save the object
dill.dump(scplus_obj, open(os.path.join(work_dir, 'scenicplus/scplus_obj.pkl'), 'wb'), protocol=-1)
raise(e)
is it safe to assume that my scplus_obj
object is still saving along the way and that the results that I need for downstream analysis should be in that file?
Not sure what the issue was, but I followed the step by step tutorial instead of the wrapper and was able to complete the analysis that way. Thanks for all of your help!
Hello,
I am trying to run
run_scenicplus()
but unfortunately the HPC that I am using does not connect with outside networks due to security reasons for the BioMart portion. Is there a workaround I can try on the scenic+ side to get around this?Thank you!