broadinstitute / pooled-cell-painting-profiling-recipe

:woman_cook: Recipe repository for image-based profiling of Pooled Cell Painting experiments
BSD 3-Clause "New" or "Revised" License
6 stars 4 forks source link

0./1.process-spots duplicate graphs #19

Closed ErinWeisbart closed 4 years ago

ErinWeisbart commented 4 years ago

When I run CP151A1 through 1.process-spots the output of gene_by_cell_category_summary_count.tsv and guide_by_cell_category_summary_count.tsv are the same.

We need to track down why.

ErinWeisbart commented 4 years ago

Likewise, full_cell_category_scores.tsv is the same as full_cell_category_scores_by_guide.tsv (except the latter has one extra column).

gwaybio commented 4 years ago

this seems like a problem. @ErinWeisbart - I think that you're equipped to track this down, do you have the bandwidth? Please let me know if not, I can keep steam rolling through!

ErinWeisbart commented 4 years ago

In the original pipeline, gene_by_cell_category_summary_count.tsv contains: Barcode_MatchedTo_GeneCode | Cell_Category | Cell_Count_Per_Gene | Cell_Class | ImageNumber | site

guide_by_cell_category_summary_count.tsv contains: Barcode_MatchedTo_GeneCode | Barcode_MatchedTo_Barcode | Cell_Category | Cell_Count_Per_Gene | Cell_Class | ImageNumber | site

So gene_by_cell_category_summary_count.tsv and guide_by_cell_category_summary_count.tsv are the same EXCEPT for an extra column in the latter (Barcode_MatchedTo_Barcode)

In the new pipeline, both gene_by_cell_category_summary_count.tsv and guide_by_cell_category_summary_count.tsv are the same as the former (noting that Cell_Category is now named Cell_Quality). So if we want to match the original pipeline, guide_by_cell_category_summary_count.tsv needs to have Barcode_MatchedTo_Barcode added to it.

In the original pipeline, full_cell_category_scores.tsv contains: Parent_Cells | Cell_Category | Barcode_MatchedTo_GeneCode | Barcode_MatchedTo_Score_mean | Barcode_MatchedTo_Score_count | ImageNumber | site

full_cell_category_scores_by_guide.tsv contains: Parent_Cells | Cell_Category | Barcode_MatchedTo_GeneCode | Barcode_MatchedTo_Barcode | Barcode_MatchedTo_Score_mean | Barcode_MatchedTo_Score_count | ImageNumber | site

So full_cell_category_scores.tsv and full_cell_category_scores_by_guide.tsv are the same EXCEPT for a single extra column in the latter ( Barcode_MatchedTo_Barcode).

In the new pipeline, full_cell_category_scores.tsv and full_cell_category_scores_by_guide.tsv are the same as their respective .tsv's in the old pipeline with the addition of one column (cell_quality_method) (noting that Cell_Category is now named Cell_Quality).

I think I'm missing the logic of having both gene_by_cell_category_summary_count.tsv and guide_by_cell_category_summary_count.tsv and both full_cell_category_scores.tsv and full_cell_category_scores_by_guide.tsv since the pairs are so similar. Additionally, I believe only full_cell_category_scores_by_guide.tsv is used downstream in our current pipeline.

It seems that we could fix and simplify the new pipeline by:

gwaybio commented 4 years ago

working on this now

gwaybio commented 4 years ago

So if we want to match the original pipeline, guide_by_cell_category_summary_count.tsv needs to have Barcode_MatchedTo_Barcode added to it.

I am a bit concerned by this - I am seeing the column Barcode_MatchedTo_Barcode in that file in my current, most up to date pipeline. I am wondering if things didn't sync properly, or if some metadata config is wonky.

gwaybio commented 4 years ago

Documenting my approach:

Perturbation Summary Counts

Remove creation of gene_by_cell_category_summary_count.tsv Add Barcode_MatchedTo_Barcode to guide_by_cell_category_summary_count.tsv

I did almost exactly this. I merged the two files together and renamed it cell_perturbation_category_summary_counts.tsv. I need to add a visualization of this file somewhere in 3.visualize-cell-summary.

Removal of full_cell_category_scores.tsv

Remove creation of full_cell_category_scores.tsv

This is an egregious mistake on my part! Crazy that we were generating this file in the first place. I removed it in a near-future PR. I also renamed the "by guide" scores file to: cell_id_barcode_alignment_scores_by_guide.tsv.gz