Add phenotypic activity (replicate reproducibility) to broad.io/orf and broad.io/crispr

AnneCarpenter commented 7 months ago

I keep looking up genes to find similar genes, and desperately wishing it were possible to be confident the gene itself has a phenotype (and therefore whether to take similar genes seriously). So we'd like to add a column that displays the phenotypic activity (replicate reproducibility) for the query sample.

AnneCarpenter commented 6 months ago

Looks like you've implemented this @afermg Woohoo! I see two new columns: (a) corrected_p_value (b) corrected_p_value Match

I think these are the phenotypic activity for the query gene itself (a) and for the matching gene (b) (and therefore NOT a p value about the confidence of the matching itself). Is that right?

And is it possible to filter a column? I see we can sort based on each column by clicking it but was wondering about filtering to put in a threshold for the matches.

AnneCarpenter commented 6 months ago

(also - did the phenotypic similarity numbers change? I usually use YAP1 as my example since it's from our paper and I thought there were super strong matches but see a max +/- 0.2 now in ORF. This may be my imagination!)

(also wondering what it means when there's no numerical value listed for a particular corrected_p_value Match)

afermg commented 6 months ago

Looks like you've implemented this @afermg Woohoo! I see two new columns: (a) corrected_p_value (b) corrected_p_value Match

I think these are the phenotypic activity for the query gene itself (a) and for the matching gene (b) (and therefore NOT a p value about the confidence of the matching itself). Is that right?

Indeed those two columns are the additions. They are for each gene, I still need to come up with better names, I only recycled the existing ones. "Replicability" is probably a better one.

And is it possible to filter a column? I see we can sort based on each column by clicking it but was wondering about filtering to put in a threshold for the matches.

Yes, you can filter using the text boxes at the top.

(also - did the phenotypic similarity numbers change? I usually use YAP1 as my example since it's from our paper and I thought there were super strong matches but see a max +/- 0.2 now in ORF. This may be my imagination!)

Potentially, this is exactly why I used Niranj's numbers, for them to be consistent. See here for the data sources.

(also wondering what it means when there's no numerical value listed for a particular corrected_p_value Match)

I hadn't found those, but it means that those particular genes/compounds were filtered-out in Niranj's analysis.

AnneCarpenter commented 6 months ago

If the text fits, I'd suggest these column headings (I know they're longer but hoping it works, to provide more context for new users): Phenotypic activity of the query (corrected_p_value) Phenotypic activity of the match (corrected_p_value) Similarity between query and match (-1 to +1) Query JCP2022 Match JCP2022 Query gene/compound example Query gene/compound

for the text within the column on Match resources, the actual text displayed could be "More about this match" instead of "External"

niranjchandrasekaran commented 6 months ago

@afermg, regarding gene similarity scores, on Shantanu's suggestion, I have been using a modified version of copairs to calculate the cosine similarity values for Morphmap. Shantanu wanted to ensure that we use the same approach (for example, how we deal with nans) in all our projects. I have the values precomputed only for the genes with a phenotype which I can share with you. But if you want all-by-all gene similarity values, you have to recompute it.

AnneCarpenter commented 6 months ago

Thanks for the info - yes we do want all by all in this public facing tool. Niranj, that would be good to include with the paper's data. I'm not sure who will do the calculation but let's be sure it ends up in the MorphMap paper's materials. @shntnu is it ok we are using a modified version of copairs?

afermg commented 6 months ago

If the text fits, I'd suggest these column headings (I know they're longer but hoping it works, to provide more context for new users): Phenotypic activity of the query (corrected_p_value) Phenotypic activity of the match (corrected_p_value) Similarity between query and match (-1 to +1) Query JCP2022 Match JCP2022 Query gene/compound example Query gene/compound

for the text within the column on Match resources, the actual text displayed could be "More about this match" instead of "External"

I can make those adjustments, but I'd prefer to keep the columns as short as we can to make more data available at a glance. Instead we can add a legend for each column at the top including a brief description of what that column is all about.

shntnu commented 6 months ago

@shntnu is it ok we are using a modified version of copairs?

@niranjchandrasekaran can you remind me what is this modification?

niranjchandrasekaran commented 6 months ago

can you remind me what is this modification?

Ah, I should clarify. I am only modifying the average_precision() function to output the cosine similarity matrix. The following are the major changes

Rather than create an Matcherobject that stores a list of gene pair tuples, I directly create this list because in the case of creating this matrix, all genes are paired with all other genes.
I don't compute negative similarities because they don't exist
Once I compute the positive similarities the same way copairs does, rather than use it to compute average precision, I create a matrix.

yes we do want all by all in this public facing tool. Niranj, that would be good to include with the paper's data. I'm not sure who will do the calculation but let's be sure it ends up in the MorphMap paper's materials.

@AnneCarpenter Yes, the tool will be included with the paper data. Also, I computed the all-by-all cosine similarity matrix myself. @afermg you can find it here: https://drive.google.com/drive/folders/14sOgJ1vYxbzmMoqDZhNqvo2DgpQ-KeJi?usp=drive_link

afermg commented 6 months ago

Thanks Niranj! Would it be sensible to put those files on CPG? somewhere by the profiles? otherwise I can upload them to Zenodo and fetch the information from there.

niranjchandrasekaran commented 6 months ago

It would make sense to put it on CPG. But I don't know what @shntnu's plans for CPG are. Any file associated with JUMP will be large, and it would be preferable to put it on CPG, but we also shouldn't complicate the folder structure on CPG. But if that's not a concern, then these files should go on CPG.

shntnu commented 5 months ago

Ah, I should clarify. I am only modifying the average_precision() function to output the cosine similarity matrix. The following are the major changes

Rather than create an Matcherobject that stores a list of gene pair tuples, I directly create this list because in the case of creating this matrix, all genes are paired with all other genes.

I don't compute negative similarities because they don't exist

Once I compute the positive similarities the same way copairs does, rather than use it to compute average precision, I create a matrix.

@alxndrkalinin Please chat with Niranj when he is in next (next profiling checkin) to decide if this is something that should be possible to do directly in copairs and how. I've forgotten all details :D

shntnu commented 5 months ago

It would make sense to put it on CPG. But I don't know what @shntnu's plans for CPG are. Any file associated with JUMP will be large, and it would be preferable to put it on CPG, but we also shouldn't complicate the folder structure on CPG. But if that's not a concern, then these files should go on CPG.

Erin proposed we use CPG for this. See https://github.com/broadinstitute/cellpainting-gallery/issues/31#issue-1574957054

Here's the structure

└── workspace
           └── publication_data
                      └── YEAR_FIRSTAUTHOR
                                  ├── large_file_example1.csv.gz
                                  ├── large_file_example2.csv.gz
                                  └── large_file_example3.csv.gz

Please see the discussion in https://github.com/broadinstitute/cellpainting-gallery/issues/31#issue-1574957054 for details.

afermg commented 5 months ago

@niranjchandrasekaran just for the record (given our previous meeting with John and Alex), my current GPU computation of cosine distances is here. It may change soon to batch the input data (for the Compounds dataset specifically). If we want to make sure that these are the same distances as in sklearn and copairs we may want to put the function somewhere.

For GPU stuff the Cupy dependencies are essential. You can find them here. Note that the provider of the libraries is nvidia, not pypi.

broadinstitute / monorepo

Add phenotypic activity (replicate reproducibility) to broad.io/orf and broad.io/crispr #25