TheJacksonLaboratory / endometriosis-scrnaseq

Code to reproduce analysis and figures for scRNA-seq analysis of endometriosis in Tan et al. 2021.
MIT License
4 stars 3 forks source link

GSEApy #8

Open yulianatan opened 3 years ago

yulianatan commented 3 years ago

trying to run gsea on for loop for 43 subclusters, but sometimes the kernel died during es/nes calculation. not sure what is the issue, but was thinking probably due to the memory usage because it ran OK when I reduce the permutation from 1000 to 100. Thinking of running it outside of the notebook (1000 permutation is the default settings). how do I effectively transform this code so I can run it with sbatch?


import gseapy as gp
import numpy as np
import pandas as pd

main_dir = "/projects/robson-lab/research/endometriosis/"
sample_id = "Endo-Tissue-EC19001-EC20015"

subclusters = pd.read_csv(f"{main_dir}analysis/{sample_id}/DEG/edgeR-input/subtypelist.csv", sep=",", index_col=0)["0"].tolist() #subclusters contains the name of each subtypes, eg ['CTL', 'NK1', 'mid-secretory',...]

for clustername in subclusters:
    deg = pd.read_csv(f"{main_dir}analysis/{sample_id}/DEG/edgeR-output/{clustername}-DEG.csv", 
                      sep=",", index_col=0)
    deg = deg[deg.FDR < 1e-03]
    pairs = deg.loc[:,deg.columns.str.contains("logFC")].columns.tolist()

    for pair in pairs:
        get_list = deg[pair].reset_index()
        get_list.rename(columns={"index":"names",pair:"logfoldchanges"},inplace=True)
        get_list.sort_values(by="logfoldchanges",ascending=False,inplace=True)
        res = gp.prerank(rnk=get_list, gene_sets="GO_Biological_Process_2018",
                         outdir=f"{main_dir}/analysis/{sample_id}/DEG/GSEA-output/{clustername}-{pair}/",
                         no_plot=True,verbose=False,permutation_num=1000)
        terms = res.res2d[res.res2d.fdr<0.05].sort_values(by="nes",ascending=False)
        terms.to_csv(f"{main_dir}/analysis/{sample_id}/DEG/GSEA-output/GOBP-filtered/gsea_{clustername}-{pair}.csv")`