Closed afermg closed 3 weeks ago
Side-note, and mostly to bring it up: Doing some research I found that FAM114A1 has two different JUMP IDs: See these babel entries.
I was able to reproduce the results I think, meaning that there is something off downstream that gets triggered in some cases. Here my minimal example that reproduces your numbers:
#!/usr/bin/env jupyter
from broad_babel.query import get_mapper
from jump_rr.datasets import get_dataset
import cupyx.scipy.spatial as spatial
import matplotlib.pyplot as plt
import polars as pl
import polars.selectors as cs
df = pl.read_parquet(get_dataset("orf"))
to_check = [
"PAFAH1B1",
"NDEL1",
"NDE1",
"CCHCR1",
"DIXDC1",
"FAM114A1",
"ATCAY",
"HOOK1",
"CRACR2B",
"HOOK2",
]
names = df.get_column("Metadata_JCP2022")
name_mapper = get_mapper(
tuple(names), input_column="JCP2022", output_columns="JCP2022,standard_key"
)
# %% get concensus and filter selected
med = df.group_by("Metadata_JCP2022").median()
perts_of_interest = med.with_columns(
std=pl.col("Metadata_JCP2022").replace(name_mapper)
).filter(pl.col("std").is_in(to_check))
vals = cp.array(perts_of_interest.select(cs.by_dtype(pl.Float32)).to_numpy())
# %% Calculate cosine distance
cosine_sim = 1 - spatial.distance.cdist(vals, vals, metric="cosine")
idx_to_sort = perts_of_interest.with_row_index().filter(pl.col("std")=="PAFAH1B1")[0,0]
names = perts_of_interest.get_column("std").to_numpy()[new_order.get()]
new_order = cosine_sim[idx_to_sort].argsort()
# %% Plotting
fig, ax = plt.subplots()
new_cosine_sim = cosine_sim.get()[new_order.get(),:]
new_cosine_sim = new_cosine_sim[:, new_order.get()]
im = ax.imshow(new_cosine_sim)
# We want to show all ticks...
rng = range(len(names))
ax.set_xticks(rng)
ax.set_yticks(rng)
ax.set_xticklabels(names)
ax.set_yticklabels(names)
# Rotate the tick labels and set their alignment.
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
for i in range(len(names)):
for j in range(len(names)):
text = ax.text(
j,
i,
f"{new_cosine_sim[i, j]:,.2f}",
ha="center",
va="center",
color="w",
fontsize="x-small",
)
ax.set_title("Cosine similarities of genes of interest")
fig.tight_layout()
plt.show()
It is fixed now! You can test it using this URL. There was some non-deterministic behaviour in one of the polars functions, which led to some undesired shuffling. I hope this hasn't caused too many troubles! I will update the broad.io links ASAP to reflect the changes.
There may still be minor inconsistencies (related to the 4th decimal number) because I am using Float32 instead of 64 for speed purposes, but most of the hits should remain the same.
broad.io links are up-to-date now.
The relevant commits are ff20f5677ee82686831ef1d838b109c7463b7635 - c0d61cfa2e6cda1bba1d82a452ffa01fd414c151
@afermg, Phew, I am glad that you were able to get the same numbers. For a minute, I was worried that I have to repeat all the Morphmap analyses :)
Side-note, and mostly to bring it up: Doing some research I found that FAM114A1 has two different JUMP IDs: See these babel entries.
You will find many such cases because in ORFs, each ORF reagent gets a JCP ID and there are many genes with multiple ORF reagents. The way I deal with these is by computing aggregated profiles by gene names (NCBI ID
or Symbol
). See https://github.com/jump-cellpainting/morphmap/issues/178 for more details.
Distances from the morphmap paper (repo) do not match the ones provided by jump_rr.
This issue resumes an email discussion.
What I have checked
@niranjchandrasekaran Could you please point me towards the code that produces your cosine similarities? Mine are here
Also the hashsums for my files are, the CRISPR dataset has 293 columns and ORF 726.
Obtained using the manifest file.
Finally, as suggested by Shantanu, we can compare the rank correlation between the distance of PAFAH1B1 (as an example) against the other genes. Would you mind sending me/pointing me towards the matrix/vector of cosine distances? Thanks!