broadinstitute / cellpainting-gallery

Cell Painting Gallery
https://broadinstitute.github.io/cellpainting-gallery/
MIT License
58 stars 11 forks source link

Create folder_structure.md #2

Closed shntnu closed 2 years ago

shntnu commented 2 years ago

1

shntnu commented 2 years ago

See https://github.com/jump-cellpainting/aws/issues/70#issuecomment-1093006256 for nomenclature discussion

shntnu commented 2 years ago

Document that extra folders like assaydev, pipelines, or segment can also be included

See https://github.com/jump-cellpainting/aws/issues/62#issuecomment-1120002420

ErinWeisbart commented 2 years ago

@shntnu How are naming for and determined? Do we set it? (And if it's "we", should I say Imaging Platform? C-S lab? CellPainting-Gallery bucket maintainers?

ErinWeisbart commented 2 years ago

@shntnu I did a pretty thorough overhaul/expansion, so please give it a read through and let me know if there's anything else you had in mind for this document.

A couple things I'd like to clarify:

shntnu commented 2 years ago

I did a pretty thorough overhaul/expansion, so please give it a read through and let me know if there's anything else you had in mind for this document.

Thanks @ErinWeisbart – this is really great!!

  • I'm not totally clear on the vision for what all goes in this bucket, ranging from "anything and everything that might possibly ever be helpful" to "the polished versions one might report in a publication". e.g. would we want to keep pipeline1_v0.cppipe, pipeline1_tryagain.cppipe, etc. or just the final pretty one used to generate the data?

Thanks for raising this Q!

This bucket will be listed at https://registry.opendata.aws/. Ideally, we'd want to make the data as FAIR as possible. But given our limited bandwidth, we will instead rely on our own experience to guess what's best and go with that.

So with that in mind – my thinking here is that we store just the final pretty one that was used to generate the data (increases findability and reusability by reducing clutter).

That said, we really do not want to maintain two copies of the data (for those datasets that we make public). That is, for such datasets that we make public, we really don't want to keep one copy on s3://cellpainting-gallery and another in our own bucket. So with that in mind, I'd say that – we store just the final pretty one that was used to generate the data, but in some (I think, rare) cases, it is ok for us to also store intermediate versions in case we think that might be valuable for ourselves (even if they are not valuable for 99% of the users)

  • I wrote it mostly describing our standard arrayed CellPainting protocol, but with some extra vagueness of how it can differ given that we intend for this bucket to also contain data from non-standard protocols (e.g. PERISCOPE). Do we instead want two documents - tighten this down to super standard protocol and then write a separate for when we have a different type of experiment coming in?

No, I think your level of generalization is great! Later, when it's time, you can make a decision on whether you'd rather create an entirely new document (in this repo) for non-standard protocols (e.g. PERISCOPE). My preference would be to have a single document, but you'd know more on how practical that is.

  • Is this mostly for
    1. our group's documentation purposes?
    2. other groups that may upload to the bucket?
    3. people wanting to download data from the bucket to orient them?

Great question 😄 Thank you for penning down the three use cases – that helps structure our thinking here.

My intent is that it is all of the above. Admittedly, one size does not fit all – e.g. the level of detail for 3. is much less than the others. But I think that's fine for now. The document should be able to serve all 3 needs, even if it is too much detail for some use cases. That is, we err on the side of over-documenting, and then simplify later as needed.

What do you think?

ErinWeisbart commented 2 years ago

Thanks for clarifying. I think this document sufficiently captures our current goals. I am 100% behind the approach of over-documenting :) I/we/someone can certainly expand it as we add non-standard projects (e.g. PERISCOPE) and if it isn't sufficient for other groups that want to upload (use case 2). I think it's helpful to have the comprehensive documentation even for people wanting to download from the bucket (use case 3) as they need to know what they're digging through even if they only want to pull out a bit of what's there.

shntnu commented 2 years ago

We’re on the same page 🎊🎉