cytomining / profiling-handbook

Image-based Profiling Handbook
https://cytomining.github.io/profiling-handbook/
Creative Commons Zero v1.0 Universal
9 stars 8 forks source link

Handbook discrepancy: metadata/<batch> vs metadata/<batch>/platemaps #70

Open bethac07 opened 2 years ago

bethac07 commented 2 years ago

Having the metadata broken out into platemaps and external is still a relatively new thing, and the image analysis team to date has typically only ever dealt with the platemap level metadata. Thus, on AWS, metadata is stored as s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/metadata/${BATCH_ID}, as evidenced in this sync command.

BUT, for the recipe, we eventually want it in s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/metadata/platemaps/${BATCH_ID}, as depicted in this tree structure. We currently handle that with just adding that extra platemaps in during the sync command above.

BUT, that's not what the handbook implies - it implies that it should be uploaded in the tree structure. So we basically have one of a couple of choices - 1) Explain everything I just wrote here in the handbook and show both tree structures there or 2) From now on, change how the analysts structure things and update the sync command, because presumably in places like the gallery in the future (but I don't know that anyone has checked for current bits of metadata over there), we want this new structure.

I'm guessing the preference here is 2, but we should make a decision and update the handbook accordingly.

shntnu commented 2 years ago

BUT, that's not what the handbook implies - it implies that it should be uploaded in the tree structure. So we basically have one of a couple of choices - 1) Explain everything I just wrote here in the handbook and show both tree structures there or 2) From now on, change how the analysts structure things and update the sync command, because presumably in places like the gallery in the future (but I don't know that anyone has checked for current bits of metadata over there), we want this new structure.

I didn't quite follow (I bet it is me, not you; your note seems super clear so I'm probably just a bit of out the loop wrt the handbook)

That said, perhaps this would help:

Our cellpainting-gallery tree structure is consistent with the profiling-recipe https://github.com/broadinstitute/cellpainting-gallery/blob/849abb85939dfb9172a24269cd0c612cb6221730/folder_structure.md#metadata-folder-structure

Our instructions for data upload say this https://github.com/broadinstitute/cellpainting-gallery/blob/849abb85939dfb9172a24269cd0c612cb6221730/upload.md#upload-metadata-folder

(i.e. sync metadata as a whole)

So to keep things consistent, I think it is better to change the sync command https://github.com/cytomining/profiling-handbook/blob/2c4dc1ba62ef5141ceb789494d450f6ba14fe05e/05-create-profiles.md?plain=1#L322

to

aws s3 sync s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/metadata/ metadata/

(i.e. just download the whole thing)

and then modify the handbook as needed so that the platemaps are expected to be found at metadata/platemaps

bethac07 commented 2 years ago

Is it fair to say that your vote then is for the image analysts to stop storing the metadata as workspace/metadata/${BATCH_ID} on AWS, but instead do workspace/metadata/platemaps/${BATCH_ID} ?

Our cellpainting-gallery tree structure is consistent with the profiling-recipe https://github.com/broadinstitute/cellpainting-gallery/blob/849abb85939dfb9172a24269cd0c612cb6221730/folder_structure.md#metadata-folder-structure

Right, that's what you want the structure to be on the gallery, but I'm saying that unless someone was careful in doing the uploads (and notably did NOT do the sync you suggest there), they may not be, because historically, we've always on AWS stored metadata in workspace/metadata/${BATCH_ID} rather than workspace/metadata/platemaps/${BATCH_ID}.

Evidence - here is the suggested tree structure at commit 07e7691af9ac6cf3768b3d4529f8b1d84648830a (June 2021)

└── metadata
    └── 2016_04_01_a549_48hr_batch1
        ├── barcode_platemap.csv
        └── platemap
            └── C-7161-01-LM6-006.txt
bethac07 commented 2 years ago

Tracing historically - as of 283558893026c01f168d30fbc85cec4ef5c70222 (June 19 2021), we were still on the old structure. Sometimes in the giant batch of commits on June 22-23, by the end of it (8ffd6c9db4da2a69ddd0f0a4338209c95cc5a825), there are no metadata instructions at all. They get added back in as part of #63 August 2021, but with platemaps now in the path.

bethac07 commented 2 years ago

@shntnu and I chatted and here are the decisions we want to make going forward

Did I miss anything?

rsenft1 commented 2 years ago

Will the profiling recipe be modified to look in metadata/platemaps/batch/platemaps? (I'm assuming this is the only place where that metadata is used).

bethac07 commented 2 years ago

That is already where the recipe looks!

rsenft1 commented 2 years ago

🤯 Oh wait, there's not actually anywhere where it 'looks' for platemaps on s3 right? We just download them to the backends machine and then from there is what's used in generating profiles?

bethac07 commented 2 years ago

Yup