WayScience / phenotypic_profiling

Machine learning for predicting 15 single-cell phenotypes from cell morphology profiles
Creative Commons Attribution 4.0 International
1 stars 3 forks source link

Single cell count discrepancy in LOIO analysis #54

Closed gwaybio closed 6 months ago

gwaybio commented 7 months ago

I see three different labeled cell counts, and I would like to confirm the correct total.

Number Source
2,916 Manuscript "Once removing A1 wells, we were left with 2,916 cells from the remaining wells."
2,862 https://github.com/WayScience/phenotypic_profiling_model/blob/main/1.split_data/explore_data.ipynb
5,702 https://github.com/WayScience/phenotypic_profiling_model/blob/main/3.evaluate_model/evaluations/LOIO_probas/compiled_LOIO_probabilities.tsv

Maybe there is something wonky going on with LOIO?

@roshankern do you know why we're seeing these cell count discrepancies?

gwaybio commented 7 months ago

The number of images we report in the manuscript and in LOIO is consistent (270)

gwaybio commented 7 months ago

Oooh, is it that the cell_UUIDs are different in IC vs. no-IC?

If so, is there a way to align cell_UUIDs?

roshankern commented 7 months ago

Oooh, is it that the cell_UUIDs are different in IC vs. no-IC?

Yep, this is definitely the case.

If so, is there a way to align cell_UUIDs?

We could try to match cells from the IC dataset to the corresponding closest cells in the no-IC dataset by location (plate, well, frame, center_x, center_y). The only issue is that we have a different number of cells in these datasets so some cells are not represented in both datasets and thus would have no match.

roshankern commented 7 months ago

Unfortunately I completely missed https://github.com/WayScience/phenotypic_profiling_model/blob/main/1.split_data/explore_data.ipynb in #49. @gwaybio would it be easy for you to change this notebook to use the new datasets? If not I could try to do this in the near future as well.

It seems you are saving the correlation files from these notebooks in this repo but not pushing to GitHub:

# Output to file
output_file = f"{output_basename}_{feature_space}.tsv.gz"
cp_tidy_corr_df.to_csv(output_file, sep="\t", index=False)
gwaybio commented 7 months ago

We could try to match cells from the IC dataset to the corresponding closest cells in the no-IC dataset by location (plate, well, frame, center_x, center_y). The only issue is that we have a different number of cells in these datasets so some cells are not be represented in both datasets and thus would have no match.

Sounds good - we can revisit this decision in the future if needed.

would it be easy for you to change this notebook to use the new datasets?

I'm not sure what you mean - do you mean we would need to apply the non-ic dataset in this analysis as well?

roshankern commented 7 months ago

Sounds good - we can revisit this decision in the future if needed.

Awesome, I'll use this methodology to associate cells across the IC and no-IC datasets.

I'm not sure what you mean - do you mean we would need to apply the non-ic dataset in this analysis as well?

This depends on if we want to perform this analysis (notebook is for pairwise correlations between single-cells) on the no-ic dataset as well (I think this is your final call). If we just want to perform this analysis for ic data, we can simply change the labeled_data_path to pathlib.Path("../0.download_data/data/labeled_data__ic.csv.gz").

I am having trouble understanding how we use output from this notebook (pairwise correlations between single-cells). It seems that you saved the output tsvs to 1.split_data/data but these files did not get uploaded to GitHub. Maybe this is a deprecated analysis that can be deleted?

If we want this analysis on ic and no-ic datasets I can modify the notebook to iterate over both datasets, but I am unsure of how to save the output tsv files. I assume they would not belong on the GitHub as you did not push them to the repo before.

Let me know the objective of this analysis/notebook and I can modify the notebook to accomplish this.

gwaybio commented 7 months ago

This depends on if we want to perform this analysis (notebook is for pairwise correlations between single-cells) on the no-ic dataset as well (I think this is your final call).

Gotcha! We do not need to do this. My view is that we use the IC model for everything except to confirm that IC is not impacting LOIO performance.

I am having trouble understanding how we use output from this notebook (pairwise correlations between single-cells). It seems that you saved the output tsvs to 1.split_data/data but these files did not get uploaded to GitHub. Maybe this is a deprecated analysis that can be deleted?

Ah, good questions! This analysis is important, and documentation can be improved. See cell 5 in https://github.com/WayScience/phenotypic_profiling_model/blob/main/7.figures/Figure2_UMAP_and_Correlation.ipynb

If we want this analysis on ic and no-ic datasets I can modify the notebook to iterate over both datasets, but I am unsure of how to save the output tsv files. I assume they would not belong on the GitHub as you did not push them to the repo before.

My instincts are that we don't need to align them. Supplementary Figure 6 shows only minimal impact between ic and no-ic. Do you expect that aligning single cells will show a different result?

Let me know the objective of this analysis/notebook and I can modify the notebook to accomplish this.

The objective is to determine how IC impacts LOIO results. Based on the previous analysis, we are able to make the following statement:

Poor LOIO performance was not a result of illumination correction, which we hypothesized could have introduced technical effects given our batched IDR_stream image processing, nor by our decision to balance models by uneven class distributions (Supplementary Figure 6B).

If matching single cells will give us a better answer and the analysis won't be too difficult, then I'd say go for it.

roshankern commented 6 months ago

Gotcha! We do not need to do this. My view is that we use the IC model for everything except to confirm that IC is not impacting LOIO performance.

Sounds good! In this case I will file I small PR to simply change the labeled_data_path to pathlib.Path("../0.download_data/data/labeled_data__ic.csv.gz").

My instincts are that we don't need to align them. Supplementary Figure 6 shows only minimal impact between ic and no-ic. Do you expect that aligning single cells will show a different result?

Nope, I wouldn't expect aligning single cells to show a different result, and I would expect it to be difficult to refactor the repository for this adjustment. Let's leave the cells unaligned 👍

gwaybio commented 6 months ago

@roshankern - is it safe to close this issue?

roshankern commented 6 months ago

Thank you for the ping on this! @gwaybio are you able to review and/or merge #65? Then we can close this issue.

gwaybio commented 6 months ago

65 now merged