broadinstitute / monorepo

Compendium of tools for the Imaging Platform
9 stars 1 forks source link

Inconsistency between Morphmap cosine similarities and JUMP_rr #49

Closed afermg closed 3 weeks ago

afermg commented 3 weeks ago

Distances from the morphmap paper (repo) do not match the ones provided by jump_rr.

This issue resumes an email discussion.

Hi Alán and Anne,

The profiles that you are using are the correct ones. I quickly checked the list of genes similar/anti-similar to PAFAH1B1 and here is what I got

PAFAH1B1 1.000000 NDEL1 0.911471 NDE1 0.860934 CCHCR1 0.556915 DIXDC1 0.541768 ...
FAM114A1 -0.435566 ATCAY -0.450775 HOOK1 -0.473261 CRACR2B -0.474904 HOOK2 -0.610536

This is what we describe in the paper. I also checked what the cosine similarity of PIWIL1 was to PAFAH1B1, and it turned out to be 0.05. This suggests that there is something entirely different about how we calculate cosine similarities. I take consensus sequences (median averaging the profiles) and then use copairs for calculating the cosine similarities (I, so far, have never found the values from copairs to be different from those calculated using sklearn). Let me know how you calculate them, and we can figure out what is causing this discrepancy. We can continue this discussion somewhere on GitHub if you would prefer that.

Niranj

How I process the data

I calculate the concensus (median) profile and then perform an all-vs-all cupy.spatial.cdist (cosine distance) operation, which yields the cosine distance. Finally, I convert the distance to simliarity by performing (1-X).

What I have checked


@niranjchandrasekaran Could you please point me towards the code that produces your cosine similarities? Mine are here

Also the hashsums for my files are, the CRISPR dataset has 293 columns and ORF 726.

5903af59605b2037190ff64c1f87c530  crispr.parquet
064759b3a850dc351b357116b3e7b32d  orf.parquet

Obtained using the manifest file.

Finally, as suggested by Shantanu, we can compare the rank correlation between the distance of PAFAH1B1 (as an example) against the other genes. Would you mind sending me/pointing me towards the matrix/vector of cosine distances? Thanks!

afermg commented 3 weeks ago

Side-note, and mostly to bring it up: Doing some research I found that FAM114A1 has two different JUMP IDs: See these babel entries.

afermg commented 3 weeks ago

I was able to reproduce the results I think, meaning that there is something off downstream that gets triggered in some cases. Here my minimal example that reproduces your numbers:

#!/usr/bin/env jupyter

from broad_babel.query import get_mapper
from jump_rr.datasets import get_dataset
import cupyx.scipy.spatial as spatial
import matplotlib.pyplot as plt
import polars as pl
import polars.selectors as cs

df = pl.read_parquet(get_dataset("orf"))
to_check = [
    "PAFAH1B1",
    "NDEL1",
    "NDE1",
    "CCHCR1",
    "DIXDC1",
    "FAM114A1",
    "ATCAY",
    "HOOK1",
    "CRACR2B",
    "HOOK2",
]
names = df.get_column("Metadata_JCP2022")
name_mapper = get_mapper(
    tuple(names), input_column="JCP2022", output_columns="JCP2022,standard_key"
)

# %% get concensus and filter selected
med = df.group_by("Metadata_JCP2022").median()

perts_of_interest = med.with_columns(
    std=pl.col("Metadata_JCP2022").replace(name_mapper)
).filter(pl.col("std").is_in(to_check))
vals = cp.array(perts_of_interest.select(cs.by_dtype(pl.Float32)).to_numpy())
# %% Calculate cosine distance
cosine_sim = 1 - spatial.distance.cdist(vals, vals, metric="cosine")

idx_to_sort = perts_of_interest.with_row_index().filter(pl.col("std")=="PAFAH1B1")[0,0]
names = perts_of_interest.get_column("std").to_numpy()[new_order.get()]
new_order = cosine_sim[idx_to_sort].argsort()

# %% Plotting
fig, ax = plt.subplots()
new_cosine_sim = cosine_sim.get()[new_order.get(),:]
new_cosine_sim = new_cosine_sim[:, new_order.get()]
im = ax.imshow(new_cosine_sim)

# We want to show all ticks...
rng = range(len(names))
ax.set_xticks(rng)
ax.set_yticks(rng)
ax.set_xticklabels(names)
ax.set_yticklabels(names)

# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
for i in range(len(names)):
    for j in range(len(names)):
        text = ax.text(
            j,
            i,
            f"{new_cosine_sim[i, j]:,.2f}",
            ha="center",
            va="center",
            color="w",
            fontsize="x-small",
        )

ax.set_title("Cosine similarities of genes of interest")
fig.tight_layout()
plt.show()

image

afermg commented 3 weeks ago

It is fixed now! You can test it using this URL. There was some non-deterministic behaviour in one of the polars functions, which led to some undesired shuffling. I hope this hasn't caused too many troubles! I will update the broad.io links ASAP to reflect the changes.

There may still be minor inconsistencies (related to the 4th decimal number) because I am using Float32 instead of 64 for speed purposes, but most of the hits should remain the same. image

afermg commented 3 weeks ago

broad.io links are up-to-date now.

afermg commented 3 weeks ago

The relevant commits are ff20f5677ee82686831ef1d838b109c7463b7635 - c0d61cfa2e6cda1bba1d82a452ffa01fd414c151

niranjchandrasekaran commented 3 weeks ago

@afermg, Phew, I am glad that you were able to get the same numbers. For a minute, I was worried that I have to repeat all the Morphmap analyses :)

Side-note, and mostly to bring it up: Doing some research I found that FAM114A1 has two different JUMP IDs: See these babel entries.

You will find many such cases because in ORFs, each ORF reagent gets a JCP ID and there are many genes with multiple ORF reagents. The way I deal with these is by computing aggregated profiles by gene names (NCBI ID or Symbol). See https://github.com/jump-cellpainting/morphmap/issues/178 for more details.