Closed ErinWeisbart closed 4 years ago
Likewise, full_cell_category_scores.tsv is the same as full_cell_category_scores_by_guide.tsv (except the latter has one extra column).
this seems like a problem. @ErinWeisbart - I think that you're equipped to track this down, do you have the bandwidth? Please let me know if not, I can keep steam rolling through!
In the original pipeline, gene_by_cell_category_summary_count.tsv
contains:
Barcode_MatchedTo_GeneCode | Cell_Category | Cell_Count_Per_Gene | Cell_Class | ImageNumber | site
guide_by_cell_category_summary_count.tsv
contains:
Barcode_MatchedTo_GeneCode | Barcode_MatchedTo_Barcode | Cell_Category | Cell_Count_Per_Gene | Cell_Class | ImageNumber | site
So gene_by_cell_category_summary_count.tsv
and guide_by_cell_category_summary_count.tsv
are the same EXCEPT for an extra column in the latter (Barcode_MatchedTo_Barcode)
In the new pipeline, both gene_by_cell_category_summary_count.tsv
and guide_by_cell_category_summary_count.tsv
are the same as the former (noting that Cell_Category is now named Cell_Quality). So if we want to match the original pipeline, guide_by_cell_category_summary_count.tsv
needs to have Barcode_MatchedTo_Barcode added to it.
In the original pipeline, full_cell_category_scores.tsv
contains:
Parent_Cells | Cell_Category | Barcode_MatchedTo_GeneCode | Barcode_MatchedTo_Score_mean | Barcode_MatchedTo_Score_count | ImageNumber | site
full_cell_category_scores_by_guide.tsv
contains:
Parent_Cells | Cell_Category | Barcode_MatchedTo_GeneCode | Barcode_MatchedTo_Barcode | Barcode_MatchedTo_Score_mean | Barcode_MatchedTo_Score_count | ImageNumber | site
So full_cell_category_scores.tsv
and full_cell_category_scores_by_guide.tsv
are the same EXCEPT for a single extra column in the latter ( Barcode_MatchedTo_Barcode).
In the new pipeline, full_cell_category_scores.tsv
and full_cell_category_scores_by_guide.tsv
are the same as their respective .tsv's in the old pipeline with the addition of one column (cell_quality_method) (noting that Cell_Category is now named Cell_Quality).
I think I'm missing the logic of having both gene_by_cell_category_summary_count.tsv
and guide_by_cell_category_summary_count.tsv
and both full_cell_category_scores.tsv
and full_cell_category_scores_by_guide.tsv
since the pairs are so similar. Additionally, I believe only full_cell_category_scores_by_guide.tsv
is used downstream in our current pipeline.
It seems that we could fix and simplify the new pipeline by:
gene_by_cell_category_summary_count.tsv
guide_by_cell_category_summary_count.tsv
full_cell_category_scores.tsv
working on this now
So if we want to match the original pipeline, guide_by_cell_category_summary_count.tsv needs to have Barcode_MatchedTo_Barcode added to it.
I am a bit concerned by this - I am seeing the column Barcode_MatchedTo_Barcode
in that file in my current, most up to date pipeline. I am wondering if things didn't sync properly, or if some metadata config is wonky.
Documenting my approach:
Remove creation of gene_by_cell_category_summary_count.tsv Add Barcode_MatchedTo_Barcode to guide_by_cell_category_summary_count.tsv
I did almost exactly this. I merged the two files together and renamed it cell_perturbation_category_summary_counts.tsv
. I need to add a visualization of this file somewhere in 3.visualize-cell-summary
.
full_cell_category_scores.tsv
Remove creation of full_cell_category_scores.tsv
This is an egregious mistake on my part! Crazy that we were generating this file in the first place. I removed it in a near-future PR. I also renamed the "by guide" scores file to: cell_id_barcode_alignment_scores_by_guide.tsv.gz
When I run CP151A1 through 1.process-spots the output of gene_by_cell_category_summary_count.tsv and guide_by_cell_category_summary_count.tsv are the same.
We need to track down why.