HumanCellAtlas / dcp-community

HCA Data Coordination Platform community content
5 stars 18 forks source link

RFC: Bundle Types #86

Closed kislyuk closed 5 years ago

kislyuk commented 5 years ago

This RFC depends on #93 and further defines the details of how bundle types will be represented in the DSS.

Last call for community review: Sept 6

brianraymor commented 5 years ago

I'm confused by This RFC depends on #93 and further defines the details of how bundle types will be represented in the DSS. when #93 is not approved. That's like taking a dependency in one standard on another which is in draft. It also states Last call for community review: August 30 - was this announced on dcp slack?

kislyuk commented 5 years ago

The author of #93 has left the project, so I'm going to take over authorship of that RFC.

Announced on the #dcp channel and extended the community review by a week.

NoopDog commented 5 years ago

It seems to me that Mallory raises legitimate concerns of duplicating information that is in the metadata in the bundle type schema. I understand that it can be useful to denormalize/duplicate for efficiency and ease of querying but perhaps we should instead try to make the metadata in the bundles more queryable rather than duplicating metadata and the metadata schema in the this bundle type schema.

I feel we might have more flexibility for the future if we kept to a model based on the elegant process/protocol concepts that already exist in the metadata schema.

Rather than explicitly defining new bundle types we could, for example, have a single bundle type, "process" that is an instance of a protocol that is defined in the metadata standard with all the benefits of the metadata type system. Then everything becomes a DAG of processes with the outputs of one process being the inputs of another, downstream process.

We would have to create some additional processes to reflect some DCP specific activities such as project creation, and more accurately model the library prep/run relationship for example but in the end we would be able to query by process types specified by the metadata schema instead of querying by a 2nd type system (the bundle type system) and having to map between the two systems.

The main idea here is that an "primary sequence bundle" is the output of a sequencing process (an instance of a sequencing protocol).

A "secondary analysis bundle" is the output of a secondary analysis process (instance of a secondary analysis protocol with a specific subtype "Optimus Prime Version x.xx").

For example if you have a process defined (roughly ) as:

PROCESS BUNDLE

process_instance_fquid
protocol_fqid
input_files[] (references , if any)
output_files[] (the actual files)
process_parameters{}
biomaterial_metadata (donor, collection, handling, prep)

This is similar to what we have now and what is proposed, but avoids creating a bundle type schema and lets one subscribe to notifications on protocol_fqid. (or modify to include the protocol name for humans)

This also allows arbitrary "data groups" or "data sets" by defining a "grouping" protocol and creating a "Process Bundle" with a grouping protocol fqid and a list of input file references.

TLDR: Keep bundles semantics free, model analysis as a process/protocol, query by process type instead of bundle type.

kislyuk commented 5 years ago

@NoopDog in an abstract DSS operating in a vacuum, what you propose makes sense. The problem is that operating without bundle types makes it eternally unclear what types of bundles applications should care about. In your example, how does an application implementer figure out which bundles to get?

In my opinion, after designing and operating Query Service, which attempts to make raw metadata directly queryable, making process/protocol metadata the sole source of information without allowing services to de-normalize it is a detriment to the usability of the system. The process/protocol graph requires sophisticated graph querying techniques to extract information from. It is an over-complexified information representation method for our experimental data.

diekhans commented 5 years ago

It is unclear why there are two RFCs related to bundle (#86 and #93).

They are not clearly delineated into covering different aspects of bundle types. I believe we would be better off merging these two.

kislyuk commented 5 years ago

@diekhans @barkasn #93 addresses the design while #86 addresses implementation details. In my estimation, they can proceed separately, but I don't mind them being merged/unified.

I don't have the bandwidth to unify the two RFCs right now. Can someone else help me with the unification process?

diekhans commented 5 years ago

The authors of the two RFCs are the same, so if none of the authors have the bandwidth to work on them and they are interdependent. If #86 is the implementation of #93, then it #86 will be blocked by #93.

Really, I think the fastest short term approach also has the best long term value, which is to move the relevant part of #93 into #86 and then drop #93.

kislyuk commented 5 years ago

Great, thanks @diekhans. When you have some time, please go ahead and merge #93 into this RFC and close #93.

diekhans commented 5 years ago

I am not suggesting I have the bandwidth. Only that the two RFCs have the same authors and they could save themselves and the reviews time by combining them.

barkasn commented 5 years ago

@kislyuk I won't be able to help with the unification process either. I don't want to block you from proceeding with your work on this RFC so I'll mark as approved, however I see a number of problems here that I think should be resolved and I would suggest you approach the respective POs and TL:

Finally, I would like to point out that the rfc approval level tag needs to be updated

kislyuk commented 5 years ago

It feels like this RFC simply formalizes de facto uses of bundles as formal bundles. If that is the intent, the summary should be updated.

Updated the summary to include "de facto".

Currently, the goal from the summary is "Formalizes data bundle types, definitions, target users/consumers, and specific use cases to improve clarity and confidence among DCP developers when working with the DCP data model.", and it feels like there is more information needed to hit that mark. What is missing for me is why are the bundles in the RFC correct? What criteria define a bundle type, how do we know when we need a new bundle, and when that occurs how is it defined? I do not understand from this document, for instance, why we would have a secondary analysis bundle and an expression matrix bundle (not to argue they should not exist but "why" is missing). If these are all correct, what specific need do each address? As I read the summary, it alludes to potentially target users and use cases may be the key to understanding bundles but that reasoning is not shown. If this is the belief of the author, those driving factors would hopefully be distilled into rules that help better define bundles and help us understand when we need more or when we reuse what we have. Without this, I do not believe I would have confidence in applying them to new problems (beyond of course just doing status quota, which does not seem to be the intent).

Added sections on this under Detailed Design, please take a look.

I must say I don't know how to specifically/precisely answer your question about why the bundle types are "correct". To my estimation and as you mentioned, the contents of this RFC reflect the de facto data type hierarchy that we're handling today. I tried to convey an operational justification (applications need to pattern match data for their input) and a heuristic for when new types are needed.

kislyuk commented 5 years ago

This RFC should start by defining what a bundle is, both functionally and as an architectural concept. This is something that is confusing to both the DCP developers and consumers. We have approved RFC 0010 that states "... the meaning of bundle is not stable or clearly defined at the moment." This RFC is the perfect opportunity to remedy this ongoing issue.

I added a section under Detailed Design, What is a bundle, which attempts to provide an operational definition. Please take a look.

Several entries in initial contents of the registry are speculative. We should only include in the actual registry when they are defined by other RFC. These are good as examples.

To my knowledge, all of the entries in the table either reflect data types in real bundles or reflect data types in outputs of applications whose operators would like to deposit data into the DCP, but currently cannot because the DCP hasn't given them specific guidance on how to do so. If you have specific objections or corrections, can you please point them out and suggest alternatives?

The actual implementation of the registry should either be defined or indicated as an unresolved issue. The process for updating the registry should be defined. Does it require an RFC or simply a pull request adding a new type?

Added a paragraph under Registry of bundle types, does that level of detail look sufficient? I went with the simpler option that you propose.

This RFC is dependent on a document that appears to be a proposed RFC: "Request for Comments: Media Type usage in the Data Coordination Platform" and seems to be stalled. This needs to be made into an RFC. It seems well written and this should be straight-forward. This can be made non-blocking to this RFC by making it an unresolved issue and creating a ticket.

That document predates the formal DCP RFC process and as such is not subject to the requirements of our RFC process (just like other external documents cited by our other RFCs). It is not "stalled" - instead, it was agreed upon and implemented.

However, for completeness and to avoid confusion, I imported it into Markdown and opened a PR here: https://github.com/HumanCellAtlas/dcp-community/pull/113.

The relationship of the "Unresolved Questions" items to this RFC is not clear.

Thank you, that was an editing oversight. Removed those questions and added a couple to reflect the discussion above.

kislyuk commented 5 years ago

Everyone, thank you for the productive comments here. Due to numerous requests, I carved out some time and merged the contents of #86 into #93. I'm closing this PR, let's pick up the discussion in #93. Please quote relevant snippets of comments when discussing. Thank you!