aertslab / scenicplus

SCENIC+ is a python package to build gene regulatory networks (GRNs) using combined or separate single-cell gene expression (scRNA-seq) and single-cell chromatin accessibility (scATAC-seq) data.
Other
176 stars 28 forks source link

Re-calculating enrichment scores after filtering eRegulons and "+"/ "-" eRegulon signs #466

Open maithermbarros opened 1 week ago

maithermbarros commented 1 week ago

Describe the bug Hey, thanks for this tool, it is giving me great insight about my dataset. I actually have 2 questions:

1

Right now, I am trying to follow the tutorials here, here and here to perform downstream analysis, but the tutorials aren't very clear.

The main problem I am facing right now is the step of re-calculating enrichment scores after filtering the output. In the tutorial it seems that the code is based on an older version of scenicplus since the output files I get don't match the ones you guys mention in the tutorial. I can't find the files named "gene_ranking" or "region_ranking".

Tutorial: image image image

My output: image

2

After running scenicplus and exploring the eRegulons, I don't understand the difference between: "-/-", "-/+" and "+/-" in their nomenclature (e.g. Thrbdirect-/+_). Can you please explain?

Version (please complete the following information): Python: 3.11.8 miniconda environemnt SCENIC+: 0.1

Thanks a lot!!

SeppeDeWinter commented 1 week ago

Hi @maithermbarros

Indeed the tutorials you are referring to are out-dated, you can find the new tutorials here: https://scenicplus.readthedocs.io/en/latest/tutorials.html.

As to your second questions. The + and - signs refer to the TF-to-gene and region-to-gene correlation coefficients. The first sign indicates wether the TF expression is either positively (+) or negatively (-) correlated with the predicted target gene expression. The second sign indiciates wether the region accessibility is either positively (+) or negatively (-) correlated with the predicted target gene expression.

I hope this helps?

All the best,

Seppe

maithermbarros commented 1 week ago

Hi @SeppeDeWinter Thanks a lot for your quick reply and for creating such an amazing tool! Yes, now I get the meaning of signs, thanks.

And I was indeed looking at the updated tutorial but since I didn't find anything on filtering the eRegulons I guess I started digging and bumped into the old ones lol. I have follow-up questions though:

1) Is there a reason why you guys removed this step of filtering eRegulons from the current tutorial? Is there a piece of code somewhere that shows how to do that with the updated list of outputs from snakemake?

2) I can't find any tutorials on how to:

Like shown in the SCENIC+ paper (see screenshot below). Any possibility that you guys can share the codes used to create figures in the paper? image

Cheers, Maithe

maithermbarros commented 2 days ago

Hi @SeppeDeWinter, sorry to bother you again but in the updated tutorial there is almost no information about accessing the results after generating the output from snakemake and the older tutorials don't really work as the output files are not the same. As I mentioned previously I am at a stage where I need to filter high quality eRegulons. Right now I have over 500 regulons and it is impossible to draw any biological conclusion like that.

In the paper you guys mention that "High-quality regulons were selected based on the correlation between gene-based regulon AUCs and region-based regulon AUCs (>0.4) and on the number of target genes (>10)." But where can I find a tutorial to calculate the correlations? This issue was also opened in the past about this https://github.com/aertslab/scenicplus/discussions/58

I calculated the correlations like this, could you please let me know what you think?


import mudata as mu
AUCell_direct = mu.read_h5mu("/Snakemake/AUCell_direct.h5mu")

### CALCULATE CORRELATION BETWEEN GENE-BASED REGULON AUC VALUES AND REGION-BASED AUC REGULON VALUES
#### Retrieve the cell information (observations) from both datasets
cells_gene_based = AUCell_direct.mod['Gene_based'].obs_names
cells_region_based = AUCell_direct.mod['Region_based'].obs_names

if all(cells_gene_based == cells_region_based):
    print("Cell order is the same")

#### Access AUC values
gene_based_auc = AUCell_direct.mod['Gene_based'].X
region_based_auc = AUCell_direct.mod['Region_based'].X

#### Ensure that the regulons (column names) match in both matrices
gene_based_regulons = AUCell_direct.mod['Gene_based'].var_names
region_based_regulons = AUCell_direct.mod['Region_based'].var_names

gene_based_regulons == region_based_regulons # they don't match because the numbers within parenthesis are not the same

#### Strip out the number of genes/regions (suffix in parentheses)
gene_based_regulons_cleaned = gene_based_regulons.str.replace(r'\_\(\d+g\)', '', regex=True)
region_based_regulons_cleaned = region_based_regulons.str.replace(r'\_\(\d+r\)', '', regex=True)

gene_based_regulons_cleaned == region_based_regulons_cleaned

# Convert matrices to DataFrames 
df_gene_based = pd.DataFrame(gene_based_auc, index = cells_gene_based, columns=gene_based_regulons_cleaned)
df_region_based = pd.DataFrame(region_based_auc, index=cells_region_based , columns=region_based_regulons_cleaned)

# calculate the correlation between corresponding regulons
correlations = df_gene_based.corrwith(df_region_based, axis=0)`

In the outdated tutorial you guys also used "gene_ranking.pkl" and "region_ranking.pkl" to recalculate enrichment scores after filtering eregulons using the function below, but I also don't know where to find gene_ranking and region_ranking info.

from scenicplus.preprocessing.filtering import apply_std_filtering_to_eRegulons
apply_std_filtering_to_eRegulons(scplus_mdata)

And would you have tutorials to plot networks and coverage plots?

Thanks a lot for your time and help, it is much appreciated :)