HumanCellAtlas / dcp-feedback

Please file issues here for feedback, feature requests, bugs, and comments on the Data Coordination Platform
BSD 3-Clause "New" or "Revised" License
2 stars 1 forks source link

Confusion over zarr files in smartseq2 expression matrix download #13

Open adriennes opened 5 years ago

adriennes commented 5 years ago

Participant downloaded matrix files for the pancreas dataset using the CLI, but found that the result was many matrices with 1 cell in each. This flow was not clear to the user

brianraymor commented 5 years ago

@adriennes - do you have more details about the steps in their process or their script. It sounds like they completely bypassed the data browser (which only shows .zattr files).( I should also note that each SS2 bundle has only one cell by design)

kbergin commented 5 years ago

It's not surprising that this was found to be confusing. We have considered and spec'd out going to more than one cell per bundle in the past but have not been able to prioritize the change. We have reasons for having it as one cell, I suspect if it were well documented or there was more UX around the CLI or a readme included in a bundle download that explained a user who is savvy enough to download directly from the data store may be fine with this structure if we don't decide to group them.

brianraymor commented 5 years ago

Documentation for matrix file formats is tracked in https://github.com/HumanCellAtlas/data-portal-content/issues/180 - but that would not explain the rationale for one cell per bundle.

NoopDog commented 5 years ago

Yes, it would be great to know if one cell per bundle for SS2 is here to stay or if it is going away. I have heard in the past that the one bundle per cell will be fixed soon but obviously we still have this structure.

In addition to being confusing to users this impacts scaling through the system including the data browser indexer through to the user. The CLI download for example will create directory per cell on the users file system in (as each cell has a bundle) and the metadata will be duplicated in each directory.

A wrangler told me that that SS2 data will be less popular and we will see more 10x instead so maybe this problem goes away by it self?

Also, this is interesting that we are exposing the user to the concept of bundles. I have frequently heard that a bundle is a storage system only concept and should not leak to users.

I would like to see if we can shield the user from bundles or if not then fully commit to trying to explain what is necessary to know.

TimothyTickle commented 5 years ago

Smart-seq2 is used less than 3' assays (10X) but both give different kinds of information. I think the landscape will continue to have both just with full transcript assays (Smart-seq2) being used less.

The reason for the 1 cell per bundle (from my knowledge): The reason for the 1 cell per bundle is so the system is real-time (requested by stakeholders). In Smart-seq2 each cell can technically be processed separately as well a pair of fastq files contains only one cell (so they are also organized separately). Because of this the smallest contained amount of data that can go through the system is a cell (for Smart-seq2). As we try to collect cells into larger groupings they become artificial. For example, a project may take years and we would not want to hold up delivery of data until a full project is complete (even though the user may want it delivered in the UI this way they will not be willing to wait years for the data).

A solution I believe we agreed on last year: As we explored this concept, a plate seems to be a reasonable unit that we can group things into given this tends to be how the Smart-seq2 data (which we support) is operationally grouped in the lab (a full plate is sequenced together). This will group cells into 100s of cells and help but is not a longterm scaling solution for a UI or any other component of the system. Mint team did some work to support this in code we have on a branch (and I believe others have done some work towards this) but it has been deprioritized for GA.

Expert users using the DSS should be ok: Anyone expert enough to access the underlying system/data programmatically should be able to manipulate the data to their needs I believe. Also, I am not sure we want users defining our internal bundle structure.

The UI should not be so strongly coupled to the underlying bundle structure: I am with you on the concern of requiring users to understand bundles. To take the portal as an example. I am naive in this space (admittedly), but on first principles, I would caution not to let the underlying bundle structure dictate the way data is presented to a user. It should support what you need but need but not mimic your solution. What happens when the bundle structure changes? Does this force the UI to change? Conversely, does this mean the bundle structure can never change once the UI for the portal is developed? Both feel too tightly coupled. As I am saying this I am also hearing difficult trade-offs, complex problems, and a lot of work so I say this as first principles and a guiding star.

Timothy Tickle Principal Product Manager Regev Lab and Data Sciences Platform The Broad Institute of MIT and Harvard

415 Main Street Cambridge, MA 02142 Mobile: 704-777-4245 ttickle@broadinstitute.org

Pronouns: He/Him/His

On Fri, Jan 11, 2019 at 5:27 AM David Rogers notifications@github.com wrote:

Yes, it would be great to know if one cell per bundle for SS2 is here to stay or if it is going away. I have heard in the past that the one bundle per cell will be fixed soon but obviously we still have this structure.

In addition to being confusing to users this impacts scaling through the system including the data browser indexer through to the user. The CLI download for example will create directory per cell on the users file system in (as each cell has a bundle) and the metadata will be duplicated in each directory.

A wrangler told me that that SS2 data will be less popular and we will see more 10x instead so maybe this problem goes away by it self?

Also, this is interesting that we are exposing the user to the concept of bundles. I have frequently heard that a bundle is a storage system only concept and should not leak to users.

I would like to see if we can shield the user from bundles or if not then fully commit to trying to explain what is necessary to know.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/HumanCellAtlas/dcp-feedback/issues/13#issuecomment-453470543, or mute the thread https://github.com/notifications/unsubscribe-auth/AFM24C-HTWaSbmGoqfqBlgH_8SISVDm8ks5vCGcBgaJpZM4ZQi2i .