Handbook discrepancy: metadata/<batch> vs metadata/<batch>/platemaps

bethac07 commented 2 years ago

Having the metadata broken out into platemaps and external is still a relatively new thing, and the image analysis team to date has typically only ever dealt with the platemap level metadata. Thus, on AWS, metadata is stored as s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/metadata/${BATCH_ID}, as evidenced in this sync command.

BUT, for the recipe, we eventually want it in s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/metadata/platemaps/${BATCH_ID}, as depicted in this tree structure. We currently handle that with just adding that extra platemaps in during the sync command above.

BUT, that's not what the handbook implies - it implies that it should be uploaded in the tree structure. So we basically have one of a couple of choices - 1) Explain everything I just wrote here in the handbook and show both tree structures there or 2) From now on, change how the analysts structure things and update the sync command, because presumably in places like the gallery in the future (but I don't know that anyone has checked for current bits of metadata over there), we want this new structure.

I'm guessing the preference here is 2, but we should make a decision and update the handbook accordingly.

shntnu commented 2 years ago

BUT, that's not what the handbook implies - it implies that it should be uploaded in the tree structure. So we basically have one of a couple of choices - 1) Explain everything I just wrote here in the handbook and show both tree structures there or 2) From now on, change how the analysts structure things and update the sync command, because presumably in places like the gallery in the future (but I don't know that anyone has checked for current bits of metadata over there), we want this new structure.

I didn't quite follow (I bet it is me, not you; your note seems super clear so I'm probably just a bit of out the loop wrt the handbook)

That said, perhaps this would help:

Our cellpainting-gallery tree structure is consistent with the profiling-recipe https://github.com/broadinstitute/cellpainting-gallery/blob/849abb85939dfb9172a24269cd0c612cb6221730/folder_structure.md#metadata-folder-structure

Our instructions for data upload say this https://github.com/broadinstitute/cellpainting-gallery/blob/849abb85939dfb9172a24269cd0c612cb6221730/upload.md#upload-metadata-folder

(i.e. sync metadata as a whole)

So to keep things consistent, I think it is better to change the sync command https://github.com/cytomining/profiling-handbook/blob/2c4dc1ba62ef5141ceb789494d450f6ba14fe05e/05-create-profiles.md?plain=1#L322

to

aws s3 sync s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/metadata/ metadata/

(i.e. just download the whole thing)

and then modify the handbook as needed so that the platemaps are expected to be found at metadata/platemaps

bethac07 commented 2 years ago

Is it fair to say that your vote then is for the image analysts to stop storing the metadata as workspace/metadata/${BATCH_ID} on AWS, but instead do workspace/metadata/platemaps/${BATCH_ID} ?

Our cellpainting-gallery tree structure is consistent with the profiling-recipe https://github.com/broadinstitute/cellpainting-gallery/blob/849abb85939dfb9172a24269cd0c612cb6221730/folder_structure.md#metadata-folder-structure

Right, that's what you want the structure to be on the gallery, but I'm saying that unless someone was careful in doing the uploads (and notably did NOT do the sync you suggest there), they may not be, because historically, we've always on AWS stored metadata in workspace/metadata/${BATCH_ID} rather than workspace/metadata/platemaps/${BATCH_ID}.

Evidence - here is the suggested tree structure at commit 07e7691af9ac6cf3768b3d4529f8b1d84648830a (June 2021)

└── metadata
    └── 2016_04_01_a549_48hr_batch1
        ├── barcode_platemap.csv
        └── platemap
            └── C-7161-01-LM6-006.txt

bethac07 commented 2 years ago

Tracing historically - as of 283558893026c01f168d30fbc85cec4ef5c70222 (June 19 2021), we were still on the old structure. Sometimes in the giant batch of commits on June 22-23, by the end of it (8ffd6c9db4da2a69ddd0f0a4338209c95cc5a825), there are no metadata instructions at all. They get added back in as part of #63 August 2021, but with platemaps now in the path.

bethac07 commented 2 years ago

@shntnu and I chatted and here are the decisions we want to make going forward

New projects should use the new /platemaps containing structure. Old projects can continue to use the old structure, because it's likely better to be parallel (so that if we do want to move them to the new structure, we can do it en masse for the whole project rather than batch-by-batch).
Notes should be added to the handbook in the metadata section and at the metadata sync command [todo: me] and to the gallery sync instructions [todo: someone?] noting that either structure might exist; specifically for the gallery, it should instruct users to either fix it at the source to include platemaps or add it in as part of the sync command
At some point, somebody [todo: ?] should check the projects that already exist in the gallery to see whether they are old structure or new, and fix them if old. While we're still at the state where there are only 18, this is probably only a ~30 minute TODO. We think JUMP-Production is likely fine due to the validation script but is TBD for others.

Did I miss anything?

rsenft1 commented 2 years ago

Will the profiling recipe be modified to look in metadata/platemaps/batch/platemaps? (I'm assuming this is the only place where that metadata is used).

bethac07 commented 2 years ago

That is already where the recipe looks!

rsenft1 commented 2 years ago

🤯 Oh wait, there's not actually anywhere where it 'looks' for platemaps on s3 right? We just download them to the backends machine and then from there is what's used in generating profiles?

bethac07 commented 2 years ago

Yup

cytomining / profiling-handbook

Handbook discrepancy: metadata/<batch> vs metadata/<batch>/platemaps #70