SynBioDex / SEPs

SBOL Enhancement Proposals
11 stars 16 forks source link

SEP 054 -- Managing genetic design packages #112

Closed jakebeal closed 1 year ago

jakebeal commented 3 years ago

This SEP proposes a set of practices for managing collections of genetic designs (or other information encoded in SBOL). These practices approach the sharing of SBOL information in a similar way to software package managers. These practices are intended to support similar management of shared SBOL information across multiple platforms and repositories.

Full draft in: https://github.com/SynBioDex/SEPs/blob/master/sep_054.md

Gonza10V commented 3 years ago

I really wanted a discussion like this.

First, in 2.1 Terminology. In Dissociated Package " Every Dissociated Package SHOULD be a Conversion Package." I think it also make sense the opposite, Every Conversion Package SHOULD be a Dissociated Package. Can a Native Package be so big (or other condition) that needs to be used as Dissociated Package. This is explained later on 2.2 Representation, Package "A package SHOULD have dissociated set only if conversion is also true". Then the logic could be more explicit, if Conversion Package then (should be) Dissociated Package.

Second, I think that this is on the same lines that the iGEM new distro. Is designed to be compatible with the new iGEM distro? Is it suggested that parts should not contain flanking assembly scars?, since those will be managed and added on the Build Artifact construction. This also should allow the use of parts with different assemblies diminishing the number of sibling parts derived from domestication on different formats.

Third, what is the actual way to do this? Is this an alternative to use PartShop on pySBOL2 (which pySBOL3 lacks) or a SynBioHub API that provides designs?

Finally, thanks for the SEP it points to a really needed discussion, how we can rehuse designs in an effective way? I like the software engineering abstraction to make SynBio where our designs are a code managed by tools similar to GitHub.

jamesscottbrown commented 3 years ago

I need to read through this again more carefully, but have a few initial comments.

Terminology

Module

I'm not sure about re-using "module", since this term was used in relation to SBOL with a very different meaning until the Component/Module merge in SBOL 3.

Also, both name "module" and "package" suggest a grouping of things, but there is no obvious ordering between the two, and their relationship is different depending on the language used: in Python modules are grouped into packages; in Go a module is a collection of packages that are versioned and released together; in npm packages and modules are both directories, but by definition a package is a a directory that contains a package.json file and a module is a directory that can be loaded by Node.js.

As an alternative to "module", how about "SBOL definition document" (or just definitionDocument)?

Build artifact

I'm also not a fan of "Build Artifact" being defined as "a document derived by assembling a set of SBOL TopLevel objects...", as this redefines a general term with a much more restricted meaning. How about "built SBOL document"?

Versioning

Versions SHOULD use SemVer format

What does this mean? For example, removing or renaming something obviously breaks backwards compatibility, but what about a change to sequence that might affect the functioning of some circuits using a part?

Shared package catalog

Restricting the contents of sip to exclude "arbitrary collections of designs without a clear engineering function" sounds reasonable, but no mechanism is described for achieving this. Who assesses this, and how would disagreements be resolved?

A public catalog SHOULD be maintained in a version control system such as git.

In existing package managers, developers typically use version control systems and pull requests during development, but then publish using a separate cli tool that uploads a snapshot of the current state of the code to the package manager. The package manager is unaware of the full history of a package: it essentially records a bunch of directory snapshots, each accompanied by a version number and a change-log message, and allows the download of any uploaded version. Packages have owners, and a new version of a package can only be uploaded by the appropriate people.

Using Git to manage the catalog might be simpler that writing software to manage the collection, but it needs clear social/administrative conventions about who is permitted to make changes to particular modules, and who is permitted to merge PRs and under what circumstances.

It might also be difficult to retrieve specific versions of specific packages, unless there is some kind of index recording what which git commit corresponds to each version of each package. Git tags (e.g., [package_name] v[version_number]) could be used, but with many packages in the same repository this might become messy.

A local sip installation can then clone the catalog in order to being working with it, and can update the catalog by pulling updates at the beginning of each sip command execution.

This would work if the repository is small, but may be unworkable if the catalog becomes large. Having an actual catalog server program would allow the download of a specific version of specific packages as zip files, which would be more efficient.

jakebeal commented 3 years ago

@Gonza10V

I think it also make sense the opposite, Every Conversion Package SHOULD be a Dissociated Package.

I disagree. For example, the FreeGenes OpenYeast Collection is one of my motivating examples for a conversion package. It is available in two non-SBOL formats, as a collection of GenBank files and as a single FASTA-like custom CSV. Eiter of these could potentially be published as a build artifact for an atomic conversion package. We also plan to translate the whole package into SBOL3 format, which will produce a pre-translated build artifact.

Can a Native Package be so big (or other condition) that needs to be used as Dissociated Package.

Since we can define what is appropriate for native packages, we can assert that they SHOULD NOT be made so large. I think that this is appropriate: if your code base is getting huge, then you need to split it into more reasonably sized modules.

If we could have no dissociated packages at all, that would be best, but I think we need something like this to support the import of materials from non-SBOL mega-databases like NCBI into more "code-like" design environments.

Second, I think that this is on the same lines that the iGEM new distro. [snip]

One of the key motivators for me is to generalize what we've done with the iGEM distribution for use in other repositories, both in the distribution project (e.g., Open Yeast) and in other unrelated projects that I'm working on.

? Is it suggested that parts should not contain flanking assembly scars?, since those will be managed and added on the Build Artifact construction.

At the moment, the document is intentionally silent on the question of assembly, in order to support many potential different approaches to the question. I would suggest, however, that this aspect of design cannot be hidden entirely in build artifacts. In the iGEM distribution we may indeed end up automating this, but I think the automation will appear during the translation from Excel sheet to module, rather than during the collation of module materials into packages.

Third, what is the actual way to do this? Is this an alternative to use PartShop on pySBOL2 (which pySBOL3 lacks) or a SynBioHub API that provides designs?

I believe that the suggestions in Section 2.3 should be applicable to both git repositories and SynBioHub.

jakebeal commented 3 years ago

@jamesscottbrown

[Terminology: Module, Package, Build artifact]

My current "module" and "package" terminology is lifted directly from Python, and "build artifact" isn't something I'm invested in either. I'm entirely receptive to changing terminology as long as we end up with a coherent and easy-to-explain set of terms.

Versions SHOULD use SemVer format

What does this mean? For example, removing or renaming something obviously breaks backwards compatibility, but what about a change to sequence that might affect the functioning of some circuits using a part?

Changing the sequence of a part would generally be a major version change, since there is no guarantee that the part will continue to have the same behavior. There are certain circumstances where it would not be, such as correcting a bug in which a sequence was not listed correctly to begin with. If you want to add a sequence that is expected to have better performance, add a new part and deprecate the old one.

Restricting the contents of sip to exclude "arbitrary collections of designs without a clear engineering function" sounds reasonable, but no mechanism is described for achieving this. Who assesses this, and how would disagreements be resolved?

This is left intentionally undefined and is a SHOULD rather than a MUST. The idea is to provide people guidance that will help them make useful and durable packages, not to act as a gatekeeper.

I do not believe that existing package managers attempt to assess the quality of the packages submitted to them either (other than to prevent abuse). You can publish a piece of complete rubbish on pypi or the Maven Central Repository, and it will simply be ignored. But there are suggestions for what it likely to make a package more valuabel to others.

A public catalog SHOULD be maintained in a version control system such as git. [snip] Using Git to manage the catalog might be simpler that writing software to manage the collection, but it needs clear social/administrative conventions about who is permitted to make changes to particular modules, and who is permitted to merge PRs and under what circumstances.

I agree with everything that you have said, which is why this is written with little constraint on actual implementation. My aim is to bootstrap with an ad hoc and informal administration using git in the experimental phase, then adjust the recommendation and implementation over time as we see how things progress in practice.

It might also be difficult to retrieve specific versions of specific packages, unless there is some kind of index recording what which git commit corresponds to each version of each package. Git tags (e.g., [package_name] v[version_number]) could be used, but with many packages in the same repository this might become messy.

As written, versions of the catalog do not correspond to versions of the packages. The current release of the catalog should always contain package information for all accessible package releases. So I should never need to look into the history of the catalog in order to find information about a package. Remember, the catalog doesn't actually store the package contents, it just stores the Package objects that give information about what those contents are and where to retrieve them from.

[cloning the catalog] would work if the repository is small, but may be unworkable if the catalog becomes large. Having an actual catalog server program would allow the download of a specific version of specific packages as zip files, which would be more efficient.

Again, remember that the catalog doesn't actually store the package contents, just the Package objects that describe them. Right now, the catalog will end up storing things like pointers to GitHub build artifacts from other repositories, or like SynBioHub links to existing collections.

Making a server or other similar sort of archive to store the binaries is not explicitly specified, but is permitted by the specification: it would just be another set of Attachment values that are heuristically preferred.

jamesscottbrown commented 3 years ago

Changing the sequence of a part would generally be a major version change, since there is no guarantee that the part will continue to have the same behavior...

I agree; guidance along these lines should probably be included in the spec.

This is left intentionally undefined and is a SHOULD rather than a MUST. The idea is to provide people guidance that will help them make useful and durable packages, not to act as a gatekeeper.

I had interpreted the initial statement as applying to the catalog creators/operators (telling them what they SHOULD keep out by some unspecified gatekeeping process), rather than applying to submitters (telling them what they SHOULD not submit); this is apparently not what you meant.

It might be clearer to re-phrase as "Arbitrary collections of designs without a clear engineering function SHOULD NOT be submitted to the catalog".

Remember, the catalog doesn't actually store the package contents, it just stores the Package objects that give information about what those contents are and where to retrieve them from.

Sorry - I had overlooked this in my initial reading, so some of my comments are confused.

However, it would be sufficient for the sip tool to work with the latest version of the catalog document, so could just download it (rather than using git clone and git pull as suggested by "A local sip installation can then clone the catalog in order to being working with it, and can update the catalog by pulling updates at the beginning of each sip command execution", which would also get a local copy of the history).

jakebeal commented 3 years ago

I've updated the SEP based on these suggestions, plus some minor wording issues I caught.

cjmyers commented 3 years ago

We've been doing similar things using SBOL Collections, but I agree that it would be good to better articulate how this should be done as best practice. There is a lot to absorb here, so it may be good to use a future SBOL3 meeting to go through this in more detail. A presentation of the idea would help. One concern I have is reusing Module so soon after it was removed from SBOL. It may cause some confusion.

jakebeal commented 3 years ago

@cjmyers This SEP is indeed explicitly intended to be compatible with what you have been doing with Collections in SynBioHub: it would end up identifying a subset of such collections as atomic packages that are good to systematize in a catalog for building upon.

As noted above to @jamesscottbrown regarding module vs. Module: my names were lifted directly from Python and I'm open to alternatives.

jakebeal commented 3 years ago

I've pushed some updates adding notes as I've worked towards an implementation.

jakebeal commented 2 years ago

Per discussion on the SBOL3 call today, I've set up a pull request that removes Module, folding its usage into Package: https://github.com/SynBioDex/SEPs/pull/115

I did not remove the distinction between member and hasModule (now hasPackage), however: as I was editing, I realized that there is an important distinction in the validation rules that need to be applied to them. Every member must have the same namespace as the Package, but every hasModule must have the namespace of the Package as a strict prefix to its own namespace. I believe that it is appropriate to manage this distinction using two different properties, rather than making a validation rule that applies differently depending on the type of the member.

cjmyers commented 2 years ago

This seems reasonable.

PrashantVaidyanathan commented 2 years ago

Thanks everyone. Is this ready for a vote now?

jakebeal commented 2 years ago

@PrashantVaidyanathan I think this needs one more week, since there's additional iGEM folks who plan to look at this.

cjmyers commented 2 years ago

Before moving this forward, I would really like to hear from some of the community RDF experts such as @goksel or @udp.

After discussing this with Jake recently, it is clear that this SEP represents a fairly significant shift in how objects are found using SBOL. My understanding of the proposal is that what is returned when a URI is dereferenced will depend upon the Package objects within a document. This means that these new objects would not be an optional new feature, but they would become required to ensure that one gets a consistent object returned when dereferencing a URI. If it is possible for some software to not use this feature, it would be useful to see how this would work.

I'm also concerned that this is moving us away from standard RDF practice. This would make SBOL more document centric rather than simply being collections of triples. In particular, when considering a top-level object, it would no longer be possible to consider it in isolation. Rather, you would need to explore other parts of the document to find the Package objects to know how to dereference URIs in this object. This means that TopLevel objects are no longer truly independent standalone objects. This is a significant change, so I would really like to understand if this is consistent with RDF.

Finally, this change makes SBOL more like code that is stored in GitHub repositories rather than knowledge stored in RDF triplestores. While a view of a version can be stored in a triplestore, having multiple different versions of an object concurrently stored in a triplestore does not seem possible. This would mean that the information would be stored in GitHub (or some other version controlled system), and views or versions would need to be copied into triplestores for RDF tooling to have access to them. Right now, I cannot think of a way that the triplestore could present anything other than one (likely latest) version.

The concerns I'm raising are not to say this is not a good idea that may solve other serious issues that some are having in working with SBOL. I would like to though not vote on this until there is clarity about all the ramifications that this will have on software development and support for SBOL3, and more voices are heard on the pros and cons of this approach.

jakebeal commented 2 years ago

I will concur with what @cjmyers has written here --- I think this generally gets the implications right.

In answer to this question:

If it is possible for some software to not use this feature, it would be useful to see how this would work.

There are at least two ways to operate without any awareness of packages.

1) Any document that doesn't embrace the package system can still refer directly to any URL. You only need to be package-aware if you're going to refer to data that's being distributed in packages.

2) If desired, a version-controlled SBOL document with URLs can be automatically reified into a "timeless" view in which snapshot images of the RDF at each version are turned into triples by injecting a version number into the URL at the junction of the namespace and local segments. Thus "http://example.com/promoters/J23101" could be split out into "http://example.com/promoters/1.0/J23101" and "http://example.com/promoters/1.1/J23101" and "http://example.com/promoters/2.0/J23101", etc. These can all then be loaded into a triple-store to provide a version-enabled view of the document. This is very similar to how versions work in SBOL2, but applies to the whole snapshot rather than link by link. Note also that this isn't dependent on this proposal, but can be done already.

It may also be desirable and possible to have the package store offer services similar to identifers.org or purl.org, where in addition to serving packages, it could also support direct retrieval of objects by "package-naive" systems via URL redirection.

cjmyers commented 2 years ago

@jakebeal thanks for the explanation. What it appears to me to be is that this proposal is actually attempting to make the idea of persistent identities from SBOL2 actually work the way they were intended. In fact, I would argue that we should consider adding back the fields of persistentIdentity and version, but couple them with this proposal. I think this would get us the best of both worlds.

My understanding of the motivation of this proposal is that we would like the ability to create packages that provide a specific version of a set of objects. This would allow you to have references to objects by a "Persistent Identity" that returns a specific version of an object as specified within a package. The version though is for a group of objects and not object-by-object like in SBOL2. For example,

Package http://example.com/myPackages/promoters/1.0/promoters_collection version: 1.0 persistentIdentity: http://example.com/myPackages/promoters/promoters_collection hasNamespace: http://example.com/myPackages/promoters/1.0/ member: http://example.com/myPackages/promoters/1.0/prom1 member: http://example.com/myPackage/promoters/1.0/prom2

Package http://example.com/myPackages/promoters/2.0/promoters_collection version: 2.0 persistentIdentity: http://example.com/myPackages/promoters/promoters_collection hasNamespace: http://example.com/myPackages/promoters/2.0/ member: http://example.com/myPackage/promoters/2.0/prom1 member: http://example.com/myPackage/promoters/2.0/prom2

Component http://example.com/myPackage/promoters/1.0/prom1 version: 1.0 hasNamespace: http://example.com/myPackages/promoters/1.0/ persistentIdentity: http://example.com/myPackage/promoters/prom1

Component http://example.com/myPackage/promoters/2.0/prom1 version: 2.0 hasNamespace: http://example.com/myPackages/promoters/2.0/ persistentIdentity: http://example.com/myPackages/promoters/prom1 ... Component http://example.com/myDevices/1.0/device1 subComponent instanceOf: http://example.com/myPackage/promoters/prom1

In order to fetch the subComponent, you would need to know which package to fetch it from. So, when you use these persistent identity references, you must provide a package to use (or perhaps default to latest version).

I think the idea of packages can make persistent identities more coherent and easier to use. They would group objects all with the same version. They would enable persistent identity references to be well grounded within a package. Finally, the example I presented above would not require the use of packages, so long as all your references were to specific versions of objects.

All of this would work with existing triplestores, including SynBioHub. In fact, this functionality is pretty close to how submissions work in SynBioHub. Persistent identity fetching was also already supported though not as elegantly as packages would allow.

jakebeal commented 2 years ago

I disagree with bringing persistent identities back. The problem that I have experienced when trying to use persistent identities is that the object was always the wrong granularity. Instead, I always needed the granularity to be at the level of an interdependent system of objects, i.e., a package.

cjmyers commented 2 years ago

I think you quite see the purpose of persistent identity. The use of it is two fold. First, it allows you to group objects that are simply different versions of the same objects. Yes, you if you assume that the version is in a particular place in the URI, then you can get to the same place. However, it makes things less explicit, and it makes it harder to implement in a search. Second, they allow you to reference an object without specifying what version you want. In SBOL2, the semantics was return the latest version. However, with the advent of packages, it can be used to say get me the version in the specified package.

Do you agree with my example above other than the addition of persistent identities and version? If so, this addresses the main aspects of my concern. Using assumptions on the URIs to avoid adding back persistent identity and version is possible, but there have been concerns raised by others in the past about assuming too much about our URIs.

jakebeal commented 2 years ago

I'm finding your example a bit hard to interpret because there are no namespaces declared and the URLs aren't following the recommendations in the SEP. Namespaces are critical to the approach precisely because they provide structure to the URI. Would you be able to rework your example with namespaces added?

jakebeal commented 2 years ago

I have added a pull request regarding the versioning of packages: https://github.com/SynBioDex/SEPs/pull/118

I will be returning to @cjmyers comments as well soon.

cjmyers commented 2 years ago

@jakebeal I've edited my example to include namespaces, and I think to meet the requirements on package URIs.

jamesamcl commented 2 years ago

Hi, sorry for the late reply. Has a vote been scheduled? I would like to read this over but there is a lot to get through.

jakebeal commented 2 years ago

@udp No vote is yet scheduled, particularly since there are two pull requests pending based on recent discussions.

To that end: @cjmyers , @goksel I have added a pull request addressing the questions that came up about how to perform direct URI retrievals like in @cjmyers example above. The key idea here is that when a versioned snapshots is generated, it needs to rewrite all of its outgoing dependency links to be to versioned snapshots as well.

cjmyers commented 2 years ago

The recent PR does appear to help address some of my concerns. However, it is not clear if you are agreeing completely with my example. In particular, for this part of my example:

Component http://example.com/myPackage/promoters/1.0/prom1 version: 1.0 hasNamespace: http://example.com/myPackages/promoters/1.0/ persistentIdentity: http://example.com/myPackage/promoters/prom1

I reintroduce version and persistentIdentity. If we do not have version and persistentIdentity fields. How do we know that this is a version of prom1. My guess is you are making assumptions about the format of the URI. In particular, the field preceding the displayId is a version. However, the version is optional. How do we know whether it is a version or part of the namespace. Hmm, I may have just answered my own question, should this be:

Component http://example.com/myPackage/promoters/1.0/prom1 displayId: prom1 hasNamespace: http://example.com/myPackages/promoters/

In this case, anything between the namespace and displayId is the version?

If this is correct, is there are anything else about my example that would not be consistent with this SEP.

jakebeal commented 2 years ago

In my view, if you aren't package-aware, you don't get to reason about version relationships, because you are opting out of the place that we are actually storing all of this information. Instead, you need to deal with the static exports and shouldn't attempt to decompose the URLs to pry out version information. If you want version information, you talk to the package rather than trying to kludge around the package.

So if you're not planning to be version-aware, then you fetch the object using the static URI http://example.com/myPackage/promoters/1.0/prom1, with a hasNamespace of http://example.com/myPackages/promoters/1.0.

If you are version-aware, then you fetch the package http://example.com/myPackage/promoters/package or http://example.com/myPackage/promoters/1.0/package, in either case finding a version of 1.0 and a hasNamespace of http://example.com/myPackages/promoters.

Continuing the version-aware example, inside the package you find a copy of http://example.com/myPackage/promoters/prom1, with a hasNamespace of http://example.com/myPackage/promoters. Given this object and the package information, you can, if you wish, prepare a static export of http://example.com/myPackage/promoters/1.0/prom1, with a hasNamespace of http://example.com/myPackages/promoters/1.0.

cjmyers commented 2 years ago

This remains problematic since a non-package aware system may receive a Component that points to objects by their "persistent identity". In my example:

Component http://example.com/myDevices/1.0/device1 subComponent instanceOf: http://example.com/myPackage/promoters/prom1

How is this dereferenced? If you are package aware, then you get the version from the package you are using. If you not package aware, then what? My suggestion is that it should fallback to returning the most recent version (i.e., the old persistent identity approach). However, this requires the system being requested to be able to determine what is the most recent version. This brings us back to either assumptions about URI structure OR restoration of persistentIdentity.

jakebeal commented 2 years ago

This would indeed be a problem, which is why the example that you give cannot happen following the proposal in the pull request.

If you retrieve a Component by the static link http://example.com/myDevices/1.0/device1 then the subComponent cannot have its instanceOf be the dynamic http://example.com/myPackages/promoters/prom1. Instead, the export process MUST also rewrite these outgoing links to be static, e.g., http://example.com/myPackages/promoters/1.3/prom1. The specific export selected is determined by dependency resolution at the time of the static export.

cjmyers commented 2 years ago

I'm not sure how this cannot happen. If you download an SBOL document with a Package entity in it. Then, you import that file into a tool like SBOLCanvas that is not aware of packages, then it will not be able to dereference "dynamic" URIs. If files exist in the wild with dynamic URIs, I don't see how you can prevent software from trying to load them.

jakebeal commented 2 years ago

I don't think that you are understanding the idea here: in the proposed mechanism, both static and dynamic URIs dereference, but they do not mix in a package. If you download materials from a package, you either get all static URIs or all dynamic URIs - not a mix like you proposed above.

Once the materials are in the hands of a package-unaware tool, they can be edited and mixed and matched in any arbitrary manner, of course. But that's fine, they aren't part of a published package, so it's no more of a problem than when somebody adds invalid URIs when working in a tool right now.

Just like right now issues with URIs get checked and cleaned up during publication to SynBioHub, if somebody tries to publish a package that mixes static and dynamic URIs, that's the point where they have to encounter package-aware software and their prospective package would fail validation and need to get cleaned up.

Bottom line: materials generated in package-unaware software will be valid SBOL, but may not be valid packages.

cjmyers commented 2 years ago

Let's assume I have an SBOL file with dynamic URIs in it. This is not complete SBOL file, in that the content referenced by these dynamic URIs is not included, but rather assumed to be fetched on demand. Namely, a package aware tool can figure out how to fetch these if/when needed, so it is not necessarily to carry them along in the file. However, this file is now opened by a package unaware tool. These dynamic URIs cannot be dereferenced, since it does not understand packages. Let's now assume that this package is stored in SBH, and it requests SBH for this object by its dynamic URI. Since SBH has not been told from which package to fetch it, it will not know which version to provide.

The solution I propose is SBH says that without a package that it provides the latest version. This is what SBH does now with persistentIdentities, so it is a backwards compatible functionality. The requirement though is that SBH needs to know what is the latest version. This is gets us back to my suggestion that have a way to know where the version is in the URI, and also a way to know what are the set of objects that are different versions of the same object. In SBOL2, this was done with the PersistentIdentity and Version fields. If you do not want to re-introduce these, then another solution is to ensure that our restrictions on URIs make it clear when objects are different version of the same entity and where the version is in the URI, so the latest one can be determined.

jakebeal commented 2 years ago

These dynamic URIs cannot be dereferenced, since it does not understand packages

This is the statement that is untrue. The proposal says that a package server that can return http://example.com/myDevices/1.0/device1 will also return the latest version when one requests http://example.com/myDevices/device1.

So if you implement what you propose for SBH, it can be in compliance with SEP 054 (as of #119). You can implement this by making SBH use Package information, or you can avoid using Package information and parse the URIs using the URI structure laid out in SEP 054. If you do not use Package information, however, you will be unable to distinguish between a situation in which there are versions and a situation in which somebody is not using packages and has just made things with URLs that look like versions.

cjmyers commented 2 years ago

Ok, I think this may work. If I'm understanding correctly, SBH can make assumptions about URI structure, since it is a minter of URIs. I still need to think more about how this might be implemented. The PersistentIdentities were useful for SPARQL queries, since it gave an easy way to fetch all objects that were different versions of the same object. If I cannot think of another way to easily do this, I guess SBH could create these triples to make fetching easier without them needing to be officially reintroduced into SBOL.

Will need to think through how this would be implemented in SBH, since I think for this SEP to be successful that providers of SBOL content will need to support it.

jamesamcl commented 2 years ago

Hi, sorry for the late reply on this. First thoughts - more to come:

I am concerned about what are clearly (to me) subclasses of Package being defined using boolean properties of Package, rather than being their own types. This is given away by the use of "is a" in the language where the conversion and dissociated properties are described: indicates if this **is a** native package (false) or conversion package (true). As far as I'm concerned as soon as you start saying "is a", it's a type rather than a property!

I think it would be cleaner to define ConversionPackage as an explicit subclass of Package. I also note that the proposal says "every dissociated package SHOULD be a conversion package". Is there any reason this can't be MUST? Because if we change it to MUST, we can merge the two as dissociated packages don't have any additional properties. This would reduce the number of classes and eliminate any use of the word "dissociated" which I find a bit confusing, unlike "conversion" which describes explicitly what the package is used for. The final hierarchy would therefore simply be Collection - Package - ConversionPackage.

Some of the sections like "Defining a Dissociated Package" would honestly make a lot more sense if they were "Defining a Conversion Package" ... because the whole section is about conversion anyway.

I am really not sure about the reintroduction of the idea of an SBOL "document" in this SEP. We worked hard to eliminate this because it makes very little sense in an RDF world when you think of SBOL as a graph. If there is any grouping of SBOL entities, I think it should be formalised by the data model explicitly, not defined implicitly by their incidental presence in the same triplestore, or RDF file, or whatever. I would strongly advise changing this SEP such that, for example, the "Defining a Package from a Document" section would become "Defining a Package from a set of SBOL TopLevels".

I don't think this would significantly change any of the proposal, mostly just the wording, but it is important IMO to make sure we don't return to assignign semantic meaning to where the triples are located. Of course it's fine to use the location fo the triples to /establish/ a collection or a package, but at that point you always have better and much more specific words than "document": you have "file", or "triplestore".

3.

I am not sure about the inclusion of "sip" in the core SBOL specification. We currently don't even require anything about serialization formats, but this is above and beyond, right down to directory structure. I broadly agree with the sip proposal, but I wonder if it should be auxillary to the SBOL data model specification? Unlike the SBOL spec which is very abstract, sip is low level and practical and I think it will change the moment we start to implement it. I imagine sip at first as more of a working interoperability spec in a github repo which evolves with its implementations. I would be much happier to vote on packages alone for inclusion into SBOL 3.x.

jakebeal commented 2 years ago

@udp Thank you for the feedback - this is very helpful.

  1. All dissociated packages are conversion packages, but not all conversion packages are dissociated packages. NCBI is a dissociated package, because one cannot possibly drink it down in one gulp. The FreeGenes Open Reporters collection, on the other hand, is a great candidate for a conversion package: it's a nice discrete, well curated collection, but it's not in SBOL3 format. I could see making ConversionPackage a sub-class of Package, and giving it the dissociated property.

  2. The key idea I'm trying to get at is the granularity on which a blob of RDF gets version controlled. We need boundaries, or else we cannot version control, and there is often sub-structure within those boundaries, e.g., namespaces within a triple store, a collection of RDF files in a directory. I'm happy to change the language as long as it can apply to both the files of a directory and a set of collections in SynBioHub. Any suggestions?

  3. I'd be fine with declaring all of the sip-related material as supplementary information about a proposed implementation, such that it doesn't get included in the standard per se. I think it's important to have in the SEP as an example of intended package use so we can work the details all the way through, however.

cjmyers commented 2 years ago

What does sip stand for?

On Mar 4, 2022, at 2:41 PM, Jacob Beal @.***> wrote:

@udp https://github.com/udp Thank you for the feedback - this is very helpful.

All dissociated packages are conversion packages, but not all conversion packages are dissociated packages. NCBI is a dissociated package, because one cannot possibly drink it down in one gulp. The FreeGenes Open Reporters collection https://stanford.freegenes.org/collections/open-genes/products/open-reporter-collection#genes, on the other hand, is a great candidate for a conversion package: it's a nice discrete, well curated collection, but it's not in SBOL3 format. I could see making ConversionPackage a sub-class of Package, and giving it the dissociated property.

The key idea I'm trying to get at is the granularity on which a blob of RDF gets version controlled. We need boundaries, or else we cannot version control, and there is often sub-structure within those boundaries, e.g., namespaces within a triple store, a collection of RDF files in a directory. I'm happy to change the language as long as it can apply to both the files of a directory and a set of collections in SynBioHub. Any suggestions?

I'd be fine with declaring all of the sip-related material as supplementary information about a proposed implementation, such that it doesn't get included in the standard per se. I think it's important to have in the SEP as an example of intended package use so we can work the details all the way through, however.

— Reply to this email directly, view it on GitHub https://github.com/SynBioDex/SEPs/issues/112#issuecomment-1059551914, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2YH54UMSB6EH6YFCYOS2LU6J7QBANCNFSM5GKQPYHQ. You are receiving this because you were mentioned.

jakebeal commented 2 years ago

SIP is defined in the section Publishing Packages to a Shared Package Catalog as "SBOL index of packages"

cjmyers commented 2 years ago

Agree with @udp that it should not be in the spec, but seems ok in the SEP

cjmyers commented 1 year ago

Closing because this has been published as a best practice here:

https://github.com/SynBioDex/SBOL-examples/tree/main/SBOL/best-practices/BP011