Evaluation metrics for representations

shntnu commented 3 years ago

@jccaicedo can you clarify the metric you are now using for evaluating representations, and how you are reporting it?

IIUC you were previously using this https://github.com/broadinstitute/DeepProfilerExperiments/blob/master/profiling/quality.py

but are now using precision-based metrics, possibly Average Precision? https://github.com/broadinstitute/DeepProfilerExperiments/blob/master/profiling/metrics.py

h/t to @gwaygenomics whose issue sent me here https://github.com/cytomining/cytominer-eval/issues/17

jccaicedo commented 3 years ago

We are using both at the moment. We will not drop metrics at the moment, but aim to expand them. I'd like to replace the enrichment analysis plot with a precision/recall or precision in the top connections plot. This is still on the works and we expect to make progress with it this week.

We have found enrichment analysis to be limited for capturing and interpreting profiling quality with ground truth connections. I'm happy to have in depth conversations about it, I'm interested in your feedback and would like to share our observations and results.

shntnu commented 3 years ago

Great idea to keep both evaluation metrics! I captured the definition of enrichment score below, for our notes.

Once you are set, can you write down exactly how you propose to use PR AUC or Precision@k in evaluating the dataset? Also, LMK if you disagree with the highlighted part in the second para below; it's possible you are setting up the averaging differently.

Enrichment score

From : https://www.nature.com/articles/s41467-019-10154-8#Sec4

We define enrichment score as the odds ratio in a one-sided Fisher’s exact test, which tests whether having high profile similarity for a treatment pair is independent of the treatments sharing an MOA/pathway. To perform the test, we form the 2 × 2 contingency table by dividing treatment pairs into four categories, based on whether they have high profile correlations, determined by a specified threshold (in rows) and whether they share an MOA/pathway (in columns). The odds ratio is then defined as the ratio of elements in the first row divided by that of the second row in the contingency table. This roughly measures how likely it is to observe same MOA/pathway treatment pairs in highly correlated vs. non-highly correlated treatment pairs.

We rejected an alternative evaluation approach, accuracy in MOA/pathway classification7, which only works well if MOAs are all well/equally represented in the dataset. The approach we took is better suited for the MOA class imbalance situation (as is the case for the datasets analyzed in this paper), as the enrichment is calculated based on a null distribution that tends to normalize MOA class sizes implicitly. Otherwise, treatments belonging to larger MOA classes tend to dominate the classification accuracy. Note that the chemical datasets we have presented reflect a huge variety of structures rather than a small number of compounds hand-picked to belong to particular classes; furthermore, annotations are sparse as many small molecules’ mechanisms are unknown. As a result, while the number of samples are large, they are spread across many classes, resulting in many classes with very few samples. Although alternate metrics such as F-score/precision/recall can help to mitigate class imbalances, they cannot overcome the small sizes for most classes in this dataset.

jccaicedo commented 3 years ago

The limitations cited in the paper make sense under a 1-NN classification approach (which is the one adopted in Ljosa 2013). In fact, enrichment analysis and 1-NN classification are two extremes of performance evaluations:

Enrichment analysis as proposed in the paper is class agnostic and considers globally strong connections only.
1-NN classification is class specific and considers only one local connection per class at a time (the nearest neighbor).

What we are exploring in our experiments is an intermediate approach: a ranked list of top connections per class. This is very common in information retrieval problems (e.g the results page of a Google query), and there are many metrics that can be used to assess relevance, including, but not limited to precision, recall, F1, and even enrichment analysis.

I'll post results here when we have something ready to share!

shntnu commented 3 years ago

Thanks @jccaicedo. I've made some notes below for us to discuss later. Looking forward to the results!

What we are exploring in our experiments is an intermediate approach: a ranked list of top connections per class ... there are many metrics that can be used to assess relevance, including, but not limited to precision, recall, F1, and even enrichment analysis.

Updated 4/1/21 after the discussion with @jccaicedo below https://github.com/broadinstitute/DeepProfilerExperiments/issues/5#issuecomment-812191704

We have a weighted graph where the vertices are perturbations with multiple labels (e.g. pathways in the case of genetic perturbations), and edges are the similarity between the vertices (e.g. the cosine similarity between image-based profiles of two CRISPR knockouts).

There are three levels of ranked lists of edges, each of which can produce global metrics (based on binary classification metrics like precision, recall, F1, etc.). These global metrics can be used to compare representations.

In all 3 cases, we pose it as a binary classification problem on the edges:

Class 1 edges: vertices have a shared label (e.g. at least one MOA in common)
Class 0 edges: vertices do not have a shared label

The three levels of ranked lists of edges, along with the metrics they induce, are below

(Not all the metrics are useful, and some may be very similar to others. I have highlighted the ones I think are useful.)

Global: Single list, comprising all edges a. We can directly compute a single global metric from this list
Label-specific: One list per label, comprising all edges that have at least one vertex with the label a. We can compute a label-specific metric, from each list, with an additional constraint on Class 1 edges: both vertices should share the label being evaluated. b. We can then (weighted) average the label-specific metrics to get a single global metric. c. We can also directly compute a global metric directly across all the label-specific lists.
Sample-specific: One list per sample, comprising all edges that have at least one vertex as that sample a. We can compute a sample-specific metric, from each list. b. We can then average the sample-specific metrics to get a label-specific metric, but filtered like in 1a although it may not be quite as straightforward; 2.e might be better. c. We can further (weighted) average the label-specific metrics to get a single global metric. e. We can also directly compute a label-specific metric directly across the sample-specific lists, but filtered like in 1a. f. We can also directly average the sample-specific metrics to get a single global metric. g. We can also directly compute a single global metric directly across all the sample-specific lists. h. We can also (weighted) average the label-specific metric in 1e to get a single global metric.

Note that Rohban does type 0.a, with the global metric being enrichment score.

I think this loosely relates to averaging types discussed here

It sounds like you are planning to do either 2 or 3.

Update: Juan et al. are doing 2.f with the Precision@K and PRAUC as the sample-specific metric

jccaicedo commented 3 years ago

I think Rohban does 0: global metric, no class specific (if I understand your classification correctly, but maybe it's just a change in indexing :P).

We have discussed and developed 2 (sample-specific) in two different flavors:

A) Each node is evaluated as a query, and the top K edges connected to it are evaluated using a binary rule (share at least one class vs no class in common).
B) Each node is evaluated as a query, and ALL edges connected to it are evaluated using the binary rule.

A is measured with precision@K, and B is measured with an interpolated Precision-Recall curve. We obtained results applying these to the TA-ORF dataset, using this implementation.

We will not do 1 in your list (class-specific evaluation) for now. Pathway and MOA annotations are not multi-class (1 out of N classes), it's multi-label (K out of N classes), which can make the connectivity and results tricky to interpret. Happy to discuss this choice further if there is interest in such a measure.

shntnu commented 3 years ago

@jccaicedo So exciting to see the new TA ORF results, with trained features being so much better than pre-trained! (and actually, even more exciting that both neural features are consistently better than CellProfiler, although I realize you've already moved beyond that :D).

The Precision@K (with low K) is probably the most relevant, so it's really great to see that the gap is very high there.

I assume the plot below is what you're referring to as Precision@K? It says Average Precision, so I was a bit confused. Maybe you meant the average of Precision@K across all samples? To make sure I understand, can you explain what is, for example, the meaning of the point at X=10 on the green curve (with Y = ~.47)?

My interpretation is:

A precision@10 value is computed for each ORF in the dataset. For each ORF, we have a ranked list of connections to other ORFs. Using this code, you compute the Precision@10 for each ORF, where relevance is 1 if there is a least one pathway in common with the other ORF, and 0 if not.
You then average the precision@10 values across all the ORFs. This turns out to be ~0.47 (for the trained features).

Is this correct?

I have updated https://github.com/broadinstitute/DeepProfilerExperiments/issues/5#issuecomment-804451302 with some notes. (Please forgive the tiresome categorization; I'm a bit too much into the weeds right now :D)

I think Rohban does 0: global metric, no class specific (if I understand your classification correctly, but maybe it's just a change in indexing :P).

You are indeed right – I was off-by-one :D. Rohban computes a type 0.a metric, with the global metric being enrichment score.

It sounds like you are doing type 2.f, with the sample-specific metric being Precision@K (in the example I cite above)

We will not do 1 in your list (class-specific evaluation) for now. Pathway and MOA annotations are not multi-class (1 out of N classes), it's multi-label (K out of N classes), which can make the connectivity and results tricky to interpret. Happy to discuss this choice further if there is interest in such a measure.

You are right about multi-label; I've updated https://github.com/broadinstitute/DeepProfilerExperiments/issues/5#issuecomment-804451302 to reflect that this is a multi-label problem.

For comparing representations – which is the main goal right now – I think your single global metric (e.g. average of Precision@K; a type 2.f if I've got that right) is perfectly sound. One can debate which one is best – average of Precision@K or mAP or something else, but it's fine – and even preferable – to report multiple.

But, going beyond comparing representations, I think it would be useful to have label-specific metrics, most likely type 1.a or 2.b because we'd love to know which MOAs or pathways are best captured using a profiling method. Happy to discuss this further now, or later if you prefer :)

jccaicedo commented 3 years ago

Your interpretation is correct, @shntnu ! It is the average of Precision@K for all samples.

can you explain what is, for example, the meaning of the point at X=10 on the green curve (with Y = ~.47)?

X=10 means that we look at the top 10 connections for each sample, and in average, Y=0.47 indicates that approx. 47% of them are biologically meaningful (have at least one class label in common).

Agree that label-specific metrics would be useful. I think in the context of image-based profiling applications, the metrics that make more sense are sample-specific. The reason is that we usually make a query and expect to retrieve a list of candidates with as high hit rate as possible. Statistics based on samples are more biologically interpretable, and therefore, the metrics in category 2 are more compelling to me. So if performance per label is of interest, I would recommend exploring 2.b.

I'll have a look and will report results in TA-ORF when ready.

broadinstitute / DeepProfilerExperiments

Evaluation metrics for representations #5

Enrichment score