Open bethac07 opened 2 years ago
BUT, that's not what the handbook implies - it implies that it should be uploaded in the tree structure. So we basically have one of a couple of choices - 1) Explain everything I just wrote here in the handbook and show both tree structures there or 2) From now on, change how the analysts structure things and update the sync command, because presumably in places like the gallery in the future (but I don't know that anyone has checked for current bits of metadata over there), we want this new structure.
I didn't quite follow (I bet it is me, not you; your note seems super clear so I'm probably just a bit of out the loop wrt the handbook)
That said, perhaps this would help:
Our cellpainting-gallery tree structure is consistent with the profiling-recipe https://github.com/broadinstitute/cellpainting-gallery/blob/849abb85939dfb9172a24269cd0c612cb6221730/folder_structure.md#metadata-folder-structure
Our instructions for data upload say this https://github.com/broadinstitute/cellpainting-gallery/blob/849abb85939dfb9172a24269cd0c612cb6221730/upload.md#upload-metadata-folder
(i.e. sync metadata
as a whole)
So to keep things consistent, I think it is better to change the sync command https://github.com/cytomining/profiling-handbook/blob/2c4dc1ba62ef5141ceb789494d450f6ba14fe05e/05-create-profiles.md?plain=1#L322
to
aws s3 sync s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/metadata/ metadata/
(i.e. just download the whole thing)
and then modify the handbook as needed so that the platemaps are expected to be found at metadata/platemaps
Is it fair to say that your vote then is for the image analysts to stop storing the metadata as workspace/metadata/${BATCH_ID}
on AWS, but instead do workspace/metadata/platemaps/${BATCH_ID}
?
Our cellpainting-gallery tree structure is consistent with the profiling-recipe https://github.com/broadinstitute/cellpainting-gallery/blob/849abb85939dfb9172a24269cd0c612cb6221730/folder_structure.md#metadata-folder-structure
Right, that's what you want the structure to be on the gallery, but I'm saying that unless someone was careful in doing the uploads (and notably did NOT do the sync you suggest there), they may not be, because historically, we've always on AWS stored metadata in workspace/metadata/${BATCH_ID}
rather than workspace/metadata/platemaps/${BATCH_ID}
.
Evidence - here is the suggested tree structure at commit 07e7691af9ac6cf3768b3d4529f8b1d84648830a (June 2021)
└── metadata
└── 2016_04_01_a549_48hr_batch1
├── barcode_platemap.csv
└── platemap
└── C-7161-01-LM6-006.txt
Tracing historically - as of 283558893026c01f168d30fbc85cec4ef5c70222 (June 19 2021), we were still on the old structure. Sometimes in the giant batch of commits on June 22-23, by the end of it (8ffd6c9db4da2a69ddd0f0a4338209c95cc5a825), there are no metadata instructions at all. They get added back in as part of #63 August 2021, but with platemaps
now in the path.
@shntnu and I chatted and here are the decisions we want to make going forward
/platemaps
containing structure. Old projects can continue to use the old structure, because it's likely better to be parallel (so that if we do want to move them to the new structure, we can do it en masse for the whole project rather than batch-by-batch).platemaps
or add it in as part of the sync commandDid I miss anything?
Will the profiling recipe be modified to look in metadata/platemaps/batch/platemaps? (I'm assuming this is the only place where that metadata is used).
That is already where the recipe looks!
🤯 Oh wait, there's not actually anywhere where it 'looks' for platemaps on s3 right? We just download them to the backends machine and then from there is what's used in generating profiles?
Yup
Having the metadata broken out into platemaps and external is still a relatively new thing, and the image analysis team to date has typically only ever dealt with the platemap level metadata. Thus, on AWS, metadata is stored as
s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/metadata/${BATCH_ID}
, as evidenced in this sync command.BUT, for the recipe, we eventually want it in
s3://${BUCKET}/projects/${PROJECT_NAME}/workspace/metadata/platemaps/${BATCH_ID}
, as depicted in this tree structure. We currently handle that with just adding that extraplatemaps
in during the sync command above.BUT, that's not what the handbook implies - it implies that it should be uploaded in the tree structure. So we basically have one of a couple of choices - 1) Explain everything I just wrote here in the handbook and show both tree structures there or 2) From now on, change how the analysts structure things and update the sync command, because presumably in places like the gallery in the future (but I don't know that anyone has checked for current bits of metadata over there), we want this new structure.
I'm guessing the preference here is 2, but we should make a decision and update the handbook accordingly.