EngreitzLab / gene_network_evaluation

Evaluation framework for computationally inferred gene networks from single-cell data.
8 stars 6 forks source link

How do we define a term as uniquely enriched or associated? #22

Closed adamklie closed 4 days ago

adamklie commented 3 weeks ago

Been using these functions that @aron0093 originally implemented:

def count(categorical_var, count_var, dataframe):
    counts_df = dataframe.value_counts([categorical_var, count_var])
    counts_df = counts_df.groupby(categorical_var).sum()
    counts_df = counts_df.sort_values(ascending=False)
    counts_df = pd.DataFrame(counts_df.reset_index().values,
                             columns=[categorical_var, count_var])
    return counts_df

def count_unique(categorical_var, count_var, dataframe, cummul=False, unique=False):
    counts_df = count(categorical_var, count_var, dataframe)
    new_df = []
    terms = []
    for prog in counts_df[categorical_var].unique():
        terms_ = dataframe.loc[dataframe[categorical_var] == prog, count_var].unique()
        unique_terms = [term for term in terms_ if term not in terms]
        terms.extend(unique_terms)
        new_df.append([prog, len(unique_terms)])
    new_df = pd.DataFrame(new_df, columns=[categorical_var, count_var])
    if cummul:
        new_df[count_var] = new_df[count_var].cumsum()
    if unique:
        return counts_df
    else:
        return new_df 

Data

program geneset p-value adjusted p-value
program1 genesetA 0.01 0.01
program1 genesetA 0.02 0.02
program2 genesetA 0.03 0.03
program2 genesetA 0.04 0.04
program3 genesetB 0.05 0.05

Can think of three ways to count terms:

1. All enriched terms

Count everything, including terms enriched multiple times in the same program (shouldn't happen right?) and terms enriched in multiple programs.

count(categorical_var=categorical_var, count_var=count_var, dataframe=unique_data)
program geneset
program1 2
program2 2
program3 1

2. Unique within a program, but can be repeated across programs

i.e if program1 and program2 are both enriched for genesetA we count it for both programs

unique_data = data.drop_duplicates(subset=[categorical_var, count_var])
count(categorical_var=categorical_var, count_var=count_var, dataframe=unique_data)
program geneset
program1 1
program2 1
program3 1

3. Unique across all programs

i.e. if program1 and program2 are both enriched for genesetA, but program1 has a much lower adjusted p-value, we only count genesetA for program1

unique_data = data.sort_values(by=sig_var)
unique_data = unique_data.drop_duplicates(subset=count_var)
unique_df = count_unique(categorical_var=categorical_var, count_var=count_var, dataframe=unique_data)
unique_df = unique_df.sort_values(count_var, ascending=False)
program geneset
program1 1
program3 1

**Note that I didn't use count_unique(..., unique=True) here because I think it arbitrarily selects which program to bin a term in when it is duplicated across terms, rather than selecting the one it is most enriched for.

Which to use?

I think it depends. To me most of the time I think 2 is the right option since we could easily have redundancy between programs and we want that captured. But maybe we make this something a dashboard user can select?

aron0093 commented 2 weeks ago

For the evaluations I have been using 3. The idea is that as long as we are capturing more biological signal with increasing K the total number of unique terms would keep increasing. When this value plateaus, we are probably adding redundant and/or noise components.

adamklie commented 4 days ago

Makes sense. This is how I've been doing it for the dashboard so far so I think we are good here