broadinstitute / cellpainting-gallery

Cell Painting Gallery
https://broadinstitute.github.io/cellpainting-gallery/
MIT License
49 stars 8 forks source link

make all load_data.csv match cpg? #69

Open ErinWeisbart opened 9 months ago

ErinWeisbart commented 9 months ago

For datasets that were processed in a location other than the cellpainting-gallery, the load_data.csv will have file paths that have the structure of the former location.

For folks downloading the files, it's a minor annoyance to update.

For folks using Distributed-Cellprofiler and wanting to work straight off the cellpainting-gallery it's more than a minor annoyance as it expects the load_data.csv to be in the same bucket as the files making it nigh impossible to use the images without copying files to their own location. I don't expect that this is happening a lot since we're providing image based profiles for all our datasets, but this is an impediment to making the images as reuseable as possible.

Do we want to do any of the following (nonexclusive categories): 1) require datasets (going forward) have correct file paths in load_data.csv upon deposition? 2) update load_data.csv to those with correct paths if/when IP folks generate them as part of their workflows? 3) perform comprehensive update of all load_data.csv for the load_data.csv already in the gallery?

ErinWeisbart commented 7 months ago

For @leoank to build the validation, we should have an answer to this question. @shntnu thoughts?

shntnu commented 7 months ago

For JUMP, we did this by hand for each source using this script, and that turned out to be super useful not just for CellProfiler use but also as an index into the images (e.g. see the example at the end of https://github.com/jump-cellpainting/datasets/blob/main/sample_notebook.ipynb)

I'd love to be able to achieve the same for all datasets in CPG (going forward) but I don't know how. Realistically, it would need rewriting https://github.com/broadinstitute/pe2loaddata such that {load_data|load_data_with_illum}.{parquet|CSV} are created from scratch by reading from S3 because fixing these by hand is too difficult (i.e. doing 1. above is difficult)

But even that would work only for PerkinElmer microscopes, because we don't have an equivalent of pe2loaddata that is generic.

So I think we should soften 1:

"where possible, datasets (going forward) should have correct file paths in load_data.csv upon deposition"

What do you think @ErinWeisbart?

ErinWeisbart commented 7 months ago

"where possible, datasets (going forward) should have correct file paths in load_data.csv upon deposition" I think that's fine.

I was thinking about using our Distributed- tools and currently they are hampered by being forced to read the load_data.csv from the same bucket at images (https://github.com/DistributedScience/Distributed-CellProfiler/issues/160) which I can update soon, but made me think that maybe we wanted to distinguish in cpg between load_data that are updated to correct cpg paths and load_data that are original paths for clarity and user-friendliness? e.g. load_data and load_data_cpg folders? or load_data_orig and load_data folders?

ErinWeisbart commented 3 months ago

@shntnu , helping get @PaulaLlanos oriented to this and wanting to confirm, all of the JUMP load_data.csvs are correct for their current CPG location, yes?

ErinWeisbart commented 3 months ago

To get started on this, we want to know the scope of the issue. So, @PaulaLlanos will inventory for each project in the gallery:

\

shntnu commented 2 months ago

@shntnu , helping get @PaulaLlanos oriented to this and wanting to confirm, all of the JUMP load_data.csvs are correct for their current CPG location, yes?

Only cpg0016; none of the others

shntnu commented 2 months ago

maybe we wanted to distinguish in cpg between load_data that are updated to correct cpg paths and load_data that are original paths for clarity and user-friendliness? e.g. load_data and load_data_cpg folders? or load_data_orig and load_data folders?

This sounds wise but I think it's even wiser to figure out what strategy would make writing the validator easier (and we should do so for any new structure we introduce going forward)

PaulaLlanos commented 2 months ago

@ErinWeisbart the new pe2loaddate_cpg create the load_data.csv pointing out the path in cpg? If this is correct, for the csv files where the paths are the mounted ones (home/ubuntu/bucket), should I regenerate the load_data.csv files using the new step function?

ErinWeisbart commented 2 months ago

@PaulaLlanos I think we want your inventory complete before we make a final decision about how/if we are going to change existing paths.

pe2loaddata makes file paths as mounted file paths (home/ubuntu/bucket). We now have a step function set up to make it easy to use off of cellpainting-gallery, however it only works on images taken on Perkin Elmer microscopes so it will not work on all datasets in the gallery.

shntnu commented 2 months ago

I'll note that it's now easy to find files using the cpg index e.g. https://gist.github.com/shntnu/a57ac6413ed41c653b566c885f29f95a so consider that instead of ls-ing

PaulaLlanos commented 2 months ago

Thank you @shntnu ! I used it. I did an inventory with 3 columns (attached): dataset_id: e.g. cpg0001 contains string: Bolean to see if the path in the load_data.csv file contains the string "home/ubuntu/bucket" load_data_path: "A column contains the full path from the load_data.csv file. If the load_data.csv file is not present in the project, the column will show 'Load_data not found in CPG.' If the file is present but empty, the column will display 'empty_load_data'."

However, I didn't find any path with the s3://cellpainting-gallery/. @ErinWeisbart I remember that you showed me a project containing that path. Could you point that out again to check why it is not detected in my code?

Moreover, do you think I should add more information to the inventory?

results.csv

ErinWeisbart commented 2 months ago

@PaulaLlanos Can you regenerate your results.csv with a column showing the batch and plate that it is querying? At a glance, it looks like most/all of projects with TRUE rows also have a Load_data not found in CPG row and I'm wondering if that's a truly missing load_data.csv or if it's looking at an errant prefix?

cpg0016-jump is the project that has s3://cellpainting-gallery/ in the file paths. I think they are uploaded as .csv.gz so you will want your script to be able to catch either .csv or .csv.gz.

At least pooled projects (e.g. cpg0021-periscope), maybe others, have load_data but they are named differently - e.g. cpg0021-periscope has a load_data for each step. Your .csv reports that it doesn't have any. So perhaps there could be an additional query to flag "has files in workspace/load_data but they are nonstandard naming".

PaulaLlanos commented 2 months ago

oh yes, I will check to debug that. Thank you so much!

PaulaLlanos commented 2 weeks ago

Hello! here I attach the inventory csv updated as we discussed with @ErinWeisbart

This table contains: CPG_id Plate Batch path point out /home/ubuntu: True or False if the path contains the string '/home/ubuntu' load_Data_path: Path in load_data.csv path contain CPG: True or False if the path contains the string 'cpg'

results_cpg.csv

Also some numbers: Total Number of 'batches' with 'path point out /home/ubuntu' True: 102 Total Number of 'batches' con 'path point out /home/ubuntu' False: 141 Total Number of 'batches' con 'path contain CPG' True: 159 Total Number of 'batches' con 'path contain CPG' False: 84

Let me know if you have any questions.

ErinWeisbart commented 1 day ago

@shntnu I think it would helpful for metadata standardization within the CPG to have all load_data.csv have paths describing their in-CPG location (instead of their location before being transferred to cpg) (and therefore to also require this for datasets going forward).

I don't think we need to enforce a particular prefix (mounted (/home/ubuntu/bucket) vs. absolute (s3://cellpainting-gallery/)) as different workflows will require different prefixes and I think its a minimal lift to expect a user to find and replace a prefix. (I've already put in a PR for Distributed-CellProfiler to be able to granularly control which bucket it is reading images vs. reading metadata vs. outputting which also makes it very easy to control reading of load_data from CPG with the DCP required prefix and reading it from elsewhere with a different prefix. As I'm typing this out, I think I also can add a further upgrade to DCP to enable reading the absolute paths as well...)

Paula is willing and able to fix them. There are 84 batches across ~9 CPG identifiers that would be corrected. Since the majority of CPG-eligible data that the IP processes now goes straight to CPG, I see this as tying up loose ends from transitioning data from our private bucket to the CPG but do not anticipate this issue accumulating again. Thoughts?