Encoding specified after the encoded data

SynBioDex / SBOL-examples

A repository to share/discuss/ask/propose how to represent examples using SBOL and SBOLVisual.

Apache License 2.0

3 stars 4 forks source link

Encoding specified after the encoded data #9

Open Juul opened 7 years ago

Juul commented 7 years ago

In the currently available SBOL examples the encoding tag within the sequence tag is specified after the end of the elements tag. This is problematic for streaming parsers since they then have to buffer the entire contents of each elements tag before it can be decoded.

If the elements tag contains a lot of data e.g. if a user of SBOL compliant software decides to save a whole unannotated genome in SBOL format then the entire genome would have to be loaded into memory in such a parser.

Possibly something to improve for future SBOL versions?

cjmyers commented 7 years ago

Thanks for the suggestion. However, it is not actually prescribed that it comes before or after. In XML, there is no order to the tags. I don’t think we can prescribe one. We could make our libraries serialize in one order, but we can never be sure some other implementation did not swap them, so we would need to be equipped for both. Keep in mind for DNA encoding is nearly always going to be the simple IUPAC code, and it really does not require any “decoding”.

On Mar 11, 2017, at 3:04 AM, Marc Juul notifications@github.com wrote:

In the currently available SBOL examples the encoding tag within the sequence tag is specified after the end of the elements tag. This is problematic for streaming parsers since they then have to buffer the entire contents of each elements tag before it can be decoded.

If the elements tag contains a lot of data e.g. if a user of SBOL compliant software decides to save a whole unannotated genome in SBOL format then the entire genome would have to be loaded into memory in such a parser.

Possibly something to improve for future SBOL versions?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/SynBioDex/SBOL-examples/issues/9, or mute the thread https://github.com/notifications/unsubscribe-auth/ADWD93t_TszEM5gkGsOitAqyuJuPXSbIks5rkg9bgaJpZM4MaEyb.

Juul commented 7 years ago

Yes I am proposing that it should not be a tag at all but rather an attribute of either the sequence or element tags. The fact that you currently can encounter the encoding tag after the elements tag is causing issues with my streaming processor.

The reason why I need to know the encoding is that I don't even know if it's DNA, Amino Acids or SMILE data before I get to the encoding tag. I could look at the data itself but you can have AA or SMILE data that consists only of characters that are legal in either format.

My streaming processor is building a BLAST database from a large amount of user-uploaded files and it needs to discard the SMILE data (and sometimes DNA or Amino Acid sequence data depending on parameters) or the BLAST database command will exit with an error. I cannot even easily pre-categorize the sbol files on user upload since a single sbol file could contain sequences with different encodings, so I'm left with no option but to buffer an unknown and potentially very large amount of sequence data.

Regardless it's always good practice to keep metadata before the actual data, rather than leaving that decision to the implementors.

graik commented 7 years ago

I think this is a pretty typical example of why we need more tightly specified formats for everyday applications (i.e. "fully specified sequence"). It also shows that the sbol:type field should really be an rdf:type field defining specific sub-classes for DNA, protein, RNA and chemicals. The fact that one needs to parse all the fields of a ComponentDefinition before knowing whether it is the expected DNA or RNA or protein or even an abstract ray of light really complicates everyday use. A DNAComponentDefinition then could be tightly specified to guarantee a certain encoding.

On Sun, Mar 12, 2017 at 10:33 AM, Marc Juul notifications@github.com wrote:

Yes I am proposing that it should not be a tag at all but rather an attribute of either the sequence or element tags. The fact that you currently can encounter the encoding tag after the elements tag is causing issues with my streaming processor.

The reason why I need to know the encoding is that I don't even know if it's DNA, Amino Acids or SMILE data before I get to the encoding tag. I could look at the data itself but you can have AA or SMILE data that consists only of characters that are legal in either format.

My streaming processor is building a BLAST database from a large amount of user-uploaded files and it needs to discard the SMILE data (and sometimes DNA or Amino Acid sequence data depending on parameters) or the BLAST database command will exit with an error. I cannot even easily pre-categorize the sbol files on user upload since a single sbol file could contain sequences with different encodings, so I'm left with no option but to buffer an unknown and potentially very large amount of sequence data.

Regardless it's always good practice to keep metadata before the actual data, rather than leaving that decision to the implementors.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/SynBioDex/SBOL-examples/issues/9, or mute the thread https://github.com/notifications/unsubscribe-auth/ABxs3T-08Qcg7jL-yBFUiP45SSxPaT1eks5rk5_agaJpZM4MaEyb .

Raik Grünberg http://www.raiks.de/contact.html

Juul commented 7 years ago

Hm, yes @graik that would definitely solve the problem. I don't know enough about SBOL to say if that might prevent some legitimate use-cases that mix DNA, protein and RNA.

cjmyers commented 7 years ago

Ah, I understand you now. You would like something like this:

attaaagaggagaaa I just tested this with libSBOLj, and it does not cause any problems to include the datatype in this way. Currently, libSBOLj will ignore this datatype field, meaning it gets dropped. However, I believe is should be preserving it, and we should fix it do so. Not sure how libSBOL/pySBOL handle it. Would be worth a test. In any case, I believe that even with SBOL today, you should be allowed to do this in your files, and it is, in my opinion, still legal SBOL serialization. I will log an issue to libSBOLj’s tracker to preserve this information. Hopefully, this will address your issue. > On Mar 12, 2017, at 7:33 AM, Marc Juul wrote: > > Yes I am proposing that it should not be a tag at all but rather an attribute of either the sequence or element tags. The fact that you currently can encounter the encoding tag after the elements tag is causing issues with my streaming processor. > > The reason why I need to know the encoding is that I don't even know if it's DNA, Amino Acids or SMILE data before I get to the encoding tag. I could look at the data itself but you can have AA or SMILE data that consists only of characters that are legal in either format. > > My streaming processor is building a BLAST database from a large amount of user-uploaded files and it needs to discard the SMILE data (and sometimes DNA or Amino Acid sequence data depending on parameters) or the BLAST database command will exit with an error. I cannot even easily pre-categorize the sbol files on user upload since a single sbol file could contain sequences with different encodings, so I'm left with no option but to buffer an unknown and potentially very large amount of sequence data. > > Regardless it's always good practice to keep metadata before the actual data, rather than leaving that decision to the implementors. > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub , or mute the thread . >

graik commented 7 years ago

This would indeed be a pretty straightforward solution but is it also valid RDF? From the OWL reference:

NOTE: It is not illegal, although not recommended, for applications to define their own datatypes by defining an instance of rdfs:Datatype. Such datatypes are "unrecognized", but are treated in a similar fashion as "unsupported datatypes" (see Sec. 6.3 https://www.w3.org/TR/owl-ref/#DatatypeSupport for details about how these should be treated by OWL tools).

I don't know whether this applies only to ontology definitions but I doubt it. It could create a problem if "elements" receives a data type that is unknown to normal RDF tools instead of the "String" that it really is. Some parsers may decide to skip the entry at the low level. Instead we could use the standard rdf:type field on the level of "Sequence" to point to something like DNA sequence, Protein sequence, etc. This would essentially mean we define sub-classes of Sequence in the SBOL data model, which is still a pretty minimal solution. Sub-classing ComponentDefinition would be much better, IMO, but is a larger change.

On Sun, Mar 12, 2017 at 1:15 PM, cjmyers notifications@github.com wrote:

Ah, I understand you now. You would like something like this:

attaaagaggagaaa</sbol:elements>

I just tested this with libSBOLj, and it does not cause any problems to include the datatype in this way. Currently, libSBOLj will ignore this datatype field, meaning it gets dropped. However, I believe is should be preserving it, and we should fix it do so. Not sure how libSBOL/pySBOL handle it. Would be worth a test.

In any case, I believe that even with SBOL today, you should be allowed to do this in your files, and it is, in my opinion, still legal SBOL serialization. I will log an issue to libSBOLj’s tracker to preserve this information. Hopefully, this will address your issue.

On Mar 12, 2017, at 7:33 AM, Marc Juul notifications@github.com wrote:

Yes I am proposing that it should not be a tag at all but rather an attribute of either the sequence or element tags. The fact that you currently can encounter the encoding tag after the elements tag is causing issues with my streaming processor.

The reason why I need to know the encoding is that I don't even know if it's DNA, Amino Acids or SMILE data before I get to the encoding tag. I could look at the data itself but you can have AA or SMILE data that consists only of characters that are legal in either format.

My streaming processor is building a BLAST database from a large amount of user-uploaded files and it needs to discard the SMILE data (and sometimes DNA or Amino Acid sequence data depending on parameters) or the BLAST database command will exit with an error. I cannot even easily pre-categorize the sbol files on user upload since a single sbol file could contain sequences with different encodings, so I'm left with no option but to buffer an unknown and potentially very large amount of sequence data.

Regardless it's always good practice to keep metadata before the actual data, rather than leaving that decision to the implementors.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/SynBioDex/SBOL-examples/issues/9# issuecomment-285927873>, or mute the thread https://github.com/ notifications/unsubscribe-auth/ADWD94Uvp7cmsctns1xFev_ 3wTE-TF-7ks5rk5_agaJpZM4MaEyb.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SynBioDex/SBOL-examples/issues/9, or mute the thread https://github.com/notifications/unsubscribe-auth/ABxs3ZqLDseQmotWgl2-tzdwtZTczyuNks5rk8XEgaJpZM4MaEyb .

Raik Grünberg http://www.raiks.de/contact.html

cjmyers commented 7 years ago

Think datatype should be fine since when ignored in my experience it is treated as string. Certainly it creates no issues with SBOL tools. Will check how virtuoso handles it.

Chris

Sent from my iPhone

On Mar 12, 2017, at 11:34 AM, Raik Grünberg notifications@github.com wrote:

This would indeed be a pretty straightforward solution but is it also valid RDF? From the OWL reference:

NOTE: It is not illegal, although not recommended, for applications to define their own datatypes by defining an instance of rdfs:Datatype. Such datatypes are "unrecognized", but are treated in a similar fashion as "unsupported datatypes" (see Sec. 6.3 https://www.w3.org/TR/owl-ref/#DatatypeSupport for details about how these should be treated by OWL tools).

I don't know whether this applies only to ontology definitions but I doubt it. It could create a problem if "elements" receives a data type that is unknown to normal RDF tools instead of the "String" that it really is. Some parsers may decide to skip the entry at the low level. Instead we could use the standard rdf:type field on the level of "Sequence" to point to something like DNA sequence, Protein sequence, etc. This would essentially mean we define sub-classes of Sequence in the SBOL data model, which is still a pretty minimal solution. Sub-classing ComponentDefinition would be much better, IMO, but is a larger change.

On Sun, Mar 12, 2017 at 1:15 PM, cjmyers notifications@github.com wrote:

Ah, I understand you now. You would like something like this:

attaaagaggagaaa</sbol:elements>

I just tested this with libSBOLj, and it does not cause any problems to include the datatype in this way. Currently, libSBOLj will ignore this datatype field, meaning it gets dropped. However, I believe is should be preserving it, and we should fix it do so. Not sure how libSBOL/pySBOL handle it. Would be worth a test.

In any case, I believe that even with SBOL today, you should be allowed to do this in your files, and it is, in my opinion, still legal SBOL serialization. I will log an issue to libSBOLj’s tracker to preserve this information. Hopefully, this will address your issue.

On Mar 12, 2017, at 7:33 AM, Marc Juul notifications@github.com wrote:

Yes I am proposing that it should not be a tag at all but rather an attribute of either the sequence or element tags. The fact that you currently can encounter the encoding tag after the elements tag is causing issues with my streaming processor.

The reason why I need to know the encoding is that I don't even know if it's DNA, Amino Acids or SMILE data before I get to the encoding tag. I could look at the data itself but you can have AA or SMILE data that consists only of characters that are legal in either format.

My streaming processor is building a BLAST database from a large amount of user-uploaded files and it needs to discard the SMILE data (and sometimes DNA or Amino Acid sequence data depending on parameters) or the BLAST database command will exit with an error. I cannot even easily pre-categorize the sbol files on user upload since a single sbol file could contain sequences with different encodings, so I'm left with no option but to buffer an unknown and potentially very large amount of sequence data.

Regardless it's always good practice to keep metadata before the actual data, rather than leaving that decision to the implementors.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/SynBioDex/SBOL-examples/issues/9# issuecomment-285927873>, or mute the thread https://github.com/ notifications/unsubscribe-auth/ADWD94Uvp7cmsctns1xFev_ 3wTE-TF-7ks5rk5_agaJpZM4MaEyb.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SynBioDex/SBOL-examples/issues/9, or mute the thread https://github.com/notifications/unsubscribe-auth/ABxs3ZqLDseQmotWgl2-tzdwtZTczyuNks5rk8XEgaJpZM4MaEyb .

--

Raik Grünberg http://www.raiks.de/contact.html

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

graik commented 7 years ago

"rdf:datatype" pointing to an .html address looks very wrong... sub-classing Sequence would be the much cleaner solution. Probably with additional benefits if we, further down the road, also implement it in the library data model.

On Sun, Mar 12, 2017 at 2:57 PM, cjmyers notifications@github.com wrote:

Think datatype should be fine since when ignored in my experience it is treated as string. Certainly it creates no issues with SBOL tools. Will check how virtuoso handles it.

Chris

Sent from my iPhone

On Mar 12, 2017, at 11:34 AM, Raik Grünberg notifications@github.com wrote:

This would indeed be a pretty straightforward solution but is it also valid RDF? From the OWL reference:

NOTE: It is not illegal, although not recommended, for applications to define their own datatypes by defining an instance of rdfs:Datatype. Such datatypes are "unrecognized", but are treated in a similar fashion as "unsupported datatypes" (see Sec. 6.3 https://www.w3.org/TR/owl-ref/#DatatypeSupport for details about how

these should be treated by OWL tools).

I don't know whether this applies only to ontology definitions but I doubt it. It could create a problem if "elements" receives a data type that is unknown to normal RDF tools instead of the "String" that it really is. Some parsers may decide to skip the entry at the low level. Instead we could use the standard rdf:type field on the level of "Sequence" to point to something like DNA sequence, Protein sequence, etc. This would essentially mean we define sub-classes of Sequence in the SBOL data model, which is still a pretty minimal solution. Sub-classing ComponentDefinition would be much better, IMO, but is a larger change.

On Sun, Mar 12, 2017 at 1:15 PM, cjmyers notifications@github.com wrote:

Ah, I understand you now. You would like something like this:

attaaagaggagaaa</sbol:elements>

I just tested this with libSBOLj, and it does not cause any problems to include the datatype in this way. Currently, libSBOLj will ignore this datatype field, meaning it gets dropped. However, I believe is should be preserving it, and we should fix it do so. Not sure how libSBOL/pySBOL handle it. Would be worth a test.

In any case, I believe that even with SBOL today, you should be allowed to do this in your files, and it is, in my opinion, still legal SBOL serialization. I will log an issue to libSBOLj’s tracker to preserve this information. Hopefully, this will address your issue.

On Mar 12, 2017, at 7:33 AM, Marc Juul notifications@github.com wrote:

Yes I am proposing that it should not be a tag at all but rather an attribute of either the sequence or element tags. The fact that you currently can encounter the encoding tag after the elements tag is causing issues with my streaming processor.

The reason why I need to know the encoding is that I don't even know if it's DNA, Amino Acids or SMILE data before I get to the encoding tag. I could look at the data itself but you can have AA or SMILE data that consists only of characters that are legal in either format.

My streaming processor is building a BLAST database from a large amount of user-uploaded files and it needs to discard the SMILE data (and sometimes DNA or Amino Acid sequence data depending on parameters) or the BLAST database command will exit with an error. I cannot even easily pre-categorize the sbol files on user upload since a single sbol file could contain sequences with different encodings, so I'm left with no option but to buffer an unknown and potentially very large amount of sequence data.

Regardless it's always good practice to keep metadata before the actual data, rather than leaving that decision to the implementors.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/SynBioDex/SBOL-examples/issues/9# issuecomment-285927873>, or mute the thread https://github.com/ notifications/unsubscribe-auth/ADWD94Uvp7cmsctns1xFev_ 3wTE-TF-7ks5rk5_agaJpZM4MaEyb.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SynBioDex/SBOL-examples/issues/9# issuecomment-285935056, or mute the thread https://github.com/notifications/unsubscribe- auth/ABxs3ZqLDseQmotWgl2-tzdwtZTczyuNks5rk8XEgaJpZM4MaEyb .

--

Raik Grünberg http://www.raiks.de/contact.html

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/SynBioDex/SBOL-examples/issues/9, or mute the thread https://github.com/notifications/unsubscribe-auth/ABxs3YaxoLIkG5QyZBq1D2NwozWf2k_Qks5rk93BgaJpZM4MaEyb .

Raik Grünberg http://www.raiks.de/contact.html

cjmyers commented 7 years ago

The problem with sub-typing is that all current SBOL tools using our libraries will not treat the object as a Sequence but rather a GeneticTopLevel, so the tools will no longer work. The subType solution requires a change to SBOL and all libraries and ultimately all software. So, this is a really heavy solution.

Here is a better one:

BBa_B0030_sequence 1 attaaagaggagaaa I just checked this with SBOL Validator. The encoding is duplicated, so tools using our libraries will find it. It is also included in the elements field as was requested. I see no problem with Marc taking this approach. If this is useful, we can modify the library serialization to include this attribute. All existing tools will work, and new tools using a streaming parser should also be able to take advantage of this. > On Mar 12, 2017, at 12:06 PM, Raik Grünberg wrote: > > "rdf:datatype" pointing to an .html address looks very wrong... > sub-classing Sequence would be the much cleaner solution. Probably with > additional benefits if we, further down the road, also implement it in the > library data model. > > On Sun, Mar 12, 2017 at 2:57 PM, cjmyers wrote: > > > Think datatype should be fine since when ignored in my experience it is > > treated as string. Certainly it creates no issues with SBOL tools. Will > > check how virtuoso handles it. > > > > Chris > > > > Sent from my iPhone > > > > > On Mar 12, 2017, at 11:34 AM, Raik Grünberg > > wrote: > > > > > > This would indeed be a pretty straightforward solution but is it also > > valid > > > RDF? From the OWL reference: > > > > > > > NOTE: It is not illegal, although not recommended, for applications to > > > define their own datatypes by defining an instance of rdfs:Datatype. Such > > > datatypes are "unrecognized", but are treated in a similar fashion as > > > "unsupported datatypes" (see Sec. 6.3 > > > for details about how > > > > > these should be treated by OWL tools). > > > > > > I don't know whether this applies only to ontology definitions but I > > doubt > > > it. It could create a problem if "elements" receives a data type that is > > > unknown to normal RDF tools instead of the "String" that it really is. > > Some > > > parsers may decide to skip the entry at the low level. Instead we could > > use > > > the standard rdf:type field on the level of "Sequence" to point to > > > something like DNA sequence, Protein sequence, etc. This would > > essentially > > > mean we define sub-classes of Sequence in the SBOL data model, which is > > > still a pretty minimal solution. Sub-classing ComponentDefinition would > > be > > > much better, IMO, but is a larger change. > > > > > > On Sun, Mar 12, 2017 at 1:15 PM, cjmyers > > wrote: > > > > > > > Ah, I understand you now. You would like something like this: > > > > > > > > attaaagaggagaaa > > > > > > > > I just tested this with libSBOLj, and it does not cause any problems to > > > > include the datatype in this way. Currently, libSBOLj will ignore this > > > > datatype field, meaning it gets dropped. However, I believe is should > > be > > > > preserving it, and we should fix it do so. Not sure how libSBOL/pySBOL > > > > handle it. Would be worth a test. > > > > > > > > In any case, I believe that even with SBOL today, you should be > > allowed to > > > > do this in your files, and it is, in my opinion, still legal SBOL > > > > serialization. I will log an issue to libSBOLj’s tracker to preserve > > this > > > > information. Hopefully, this will address your issue. > > > > > > > > > On Mar 12, 2017, at 7:33 AM, Marc Juul > > wrote: > > > > > > > > > > Yes I am proposing that it should not be a tag at all but rather an > > > > attribute of either the sequence or element tags. The fact that you > > > > currently can encounter the encoding tag after the elements tag is > > causing > > > > issues with my streaming processor. > > > > > > > > > > The reason why I need to know the encoding is that I don't even know > > if > > > > it's DNA, Amino Acids or SMILE data before I get to the encoding tag. I > > > > could look at the data itself but you can have AA or SMILE data that > > > > consists only of characters that are legal in either format. > > > > > > > > > > My streaming processor is building a BLAST database from a large > > amount > > > > of user-uploaded files and it needs to discard the SMILE data (and > > > > sometimes DNA or Amino Acid sequence data depending on parameters) or > > the > > > > BLAST database command will exit with an error. I cannot even easily > > > > pre-categorize the sbol files on user upload since a single sbol file > > could > > > > contain sequences with different encodings, so I'm left with no option > > but > > > > to buffer an unknown and potentially very large amount of sequence > > data. > > > > > > > > > > Regardless it's always good practice to keep metadata before the > > actual > > > > data, rather than leaving that decision to the implementors. > > > > > > > > > > — > > > > > You are receiving this because you commented. > > > > > Reply to this email directly, view it on GitHub < > > > > https://github.com/SynBioDex/SBOL-examples/issues/9# > > > > issuecomment-285927873>, or mute the thread > > > notifications/unsubscribe-auth/ADWD94Uvp7cmsctns1xFev_ > > > > 3wTE-TF-7ks5rk5_agaJpZM4MaEyb>. > > > > > > > > > > > > > — > > > > You are receiving this because you were mentioned. > > > > Reply to this email directly, view it on GitHub > > > > > issuecomment-285935056>, > > > > or mute the thread > > > > > auth/ABxs3ZqLDseQmotWgl2-tzdwtZTczyuNks5rk8XEgaJpZM4MaEyb> > > > > . > > > > > > > > > > > > > > > > -- > > > ___________________________________ > > > Raik Grünberg > > > http://www.raiks.de/contact.html > > > ___________________________________ > > > — > > > You are receiving this because you commented. > > > Reply to this email directly, view it on GitHub, or mute the thread. > > > > > > > — > > You are receiving this because you were mentioned. > > Reply to this email directly, view it on GitHub > > , > > or mute the thread > > > > . > > > > > > -- > ___________________________________ > Raik Grünberg > http://www.raiks.de/contact.html > ___________________________________ > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub , or mute the thread . >

graik commented 7 years ago

I see. Keeping things within the sbol name space. Yes, this looks like a good fix. Perhaps Sequence sub-types can be raised again for sbol 3.

Greetings Raik

On Sun, Mar 12, 2017 at 3:22 PM, cjmyers notifications@github.com wrote:

The problem with sub-typing is that all current SBOL tools using our libraries will not treat the object as a Sequence but rather a GeneticTopLevel, so the tools will no longer work. The subType solution requires a change to SBOL and all libraries and ultimately all software. So, this is a really heavy solution.

Here is a better one:

BBa_B0030_sequence 1 attaaagaggagaaa I just checked this with SBOL Validator. The encoding is duplicated, so tools using our libraries will find it. It is also included in the elements field as was requested. I see no problem with Marc taking this approach. If this is useful, we can modify the library serialization to include this attribute. All existing tools will work, and new tools using a streaming parser should also be able to take advantage of this. > On Mar 12, 2017, at 12:06 PM, Raik Grünberg wrote: > > "rdf:datatype" pointing to an .html address looks very wrong... > sub-classing Sequence would be the much cleaner solution. Probably with > additional benefits if we, further down the road, also implement it in the > library data model. > > On Sun, Mar 12, 2017 at 2:57 PM, cjmyers wrote: > > > Think datatype should be fine since when ignored in my experience it is > > treated as string. Certainly it creates no issues with SBOL tools. Will > > check how virtuoso handles it. > > > > Chris > > > > Sent from my iPhone > > > > > On Mar 12, 2017, at 11:34 AM, Raik Grünberg < notifications@github.com> > > wrote: > > > > > > This would indeed be a pretty straightforward solution but is it also > > valid > > > RDF? From the OWL reference: > > > > > > > NOTE: It is not illegal, although not recommended, for applications to > > > define their own datatypes by defining an instance of rdfs:Datatype. Such > > > datatypes are "unrecognized", but are treated in a similar fashion as > > > "unsupported datatypes" (see Sec. 6.3 > > > for details about how > > > > > these should be treated by OWL tools). > > > > > > I don't know whether this applies only to ontology definitions but I > > doubt > > > it. It could create a problem if "elements" receives a data type that is > > > unknown to normal RDF tools instead of the "String" that it really is. > > Some > > > parsers may decide to skip the entry at the low level. Instead we could > > use > > > the standard rdf:type field on the level of "Sequence" to point to > > > something like DNA sequence, Protein sequence, etc. This would > > essentially > > > mean we define sub-classes of Sequence in the SBOL data model, which is > > > still a pretty minimal solution. Sub-classing ComponentDefinition would > > be > > > much better, IMO, but is a larger change. > > > > > > On Sun, Mar 12, 2017 at 1:15 PM, cjmyers > > wrote: > > > > > > > Ah, I understand you now. You would like something like this: > > > > > > > > attaaagaggagaaa > > > > > > > > I just tested this with libSBOLj, and it does not cause any problems to > > > > include the datatype in this way. Currently, libSBOLj will ignore this > > > > datatype field, meaning it gets dropped. However, I believe is should > > be > > > > preserving it, and we should fix it do so. Not sure how libSBOL/pySBOL > > > > handle it. Would be worth a test. > > > > > > > > In any case, I believe that even with SBOL today, you should be > > allowed to > > > > do this in your files, and it is, in my opinion, still legal SBOL > > > > serialization. I will log an issue to libSBOLj’s tracker to preserve > > this > > > > information. Hopefully, this will address your issue. > > > > > > > > > On Mar 12, 2017, at 7:33 AM, Marc Juul > > wrote: > > > > > > > > > > Yes I am proposing that it should not be a tag at all but rather an > > > > attribute of either the sequence or element tags. The fact that you > > > > currently can encounter the encoding tag after the elements tag is > > causing > > > > issues with my streaming processor. > > > > > > > > > > The reason why I need to know the encoding is that I don't even know > > if > > > > it's DNA, Amino Acids or SMILE data before I get to the encoding tag. I > > > > could look at the data itself but you can have AA or SMILE data that > > > > consists only of characters that are legal in either format. > > > > > > > > > > My streaming processor is building a BLAST database from a large > > amount > > > > of user-uploaded files and it needs to discard the SMILE data (and > > > > sometimes DNA or Amino Acid sequence data depending on parameters) or > > the > > > > BLAST database command will exit with an error. I cannot even easily > > > > pre-categorize the sbol files on user upload since a single sbol file > > could > > > > contain sequences with different encodings, so I'm left with no option > > but > > > > to buffer an unknown and potentially very large amount of sequence > > data. > > > > > > > > > > Regardless it's always good practice to keep metadata before the > > actual > > > > data, rather than leaving that decision to the implementors. > > > > > > > > > > — > > > > > You are receiving this because you commented. > > > > > Reply to this email directly, view it on GitHub < > > > > https://github.com/SynBioDex/SBOL-examples/issues/9# > > > > issuecomment-285927873>, or mute the thread > > > notifications/unsubscribe-auth/ADWD94Uvp7cmsctns1xFev_ > > > > 3wTE-TF-7ks5rk5_agaJpZM4MaEyb>. > > > > > > > > > > > > > — > > > > You are receiving this because you were mentioned. > > > > Reply to this email directly, view it on GitHub > > > > > issuecomment-285935056>, > > > > or mute the thread > > > > > auth/ABxs3ZqLDseQmotWgl2-tzdwtZTczyuNks5rk8XEgaJpZM4MaEyb> > > > > . > > > > > > > > > > > > > > > > -- > > > ___________________________________ > > > Raik Grünberg > > > http://www.raiks.de/contact.html > > > ___________________________________ > > > — > > > You are receiving this because you commented. > > > Reply to this email directly, view it on GitHub, or mute the thread. > > > > > > > — > > You are receiving this because you were mentioned. > > Reply to this email directly, view it on GitHub > > , > > or mute the thread > > > > . > > > > > > -- > ___________________________________ > Raik Grünberg > http://www.raiks.de/contact.html > ___________________________________ > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub < https://github.com/SynBioDex/SBOL-examples/issues/9# issuecomment-285940284>, or mute the thread . > — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub , or mute the thread .

Raik Grünberg http://www.raiks.de/contact.html

palchicz commented 5 years ago

@cjmyers has the change been incorporated into the library already and @Juul does this address your concern?

jakebeal commented 4 years ago

I believe this is now moot for SBOL 3, which uses RDF as a serialization format (such that we don't have control of ordering) and which also allows genome-scale sequences to be stored as ExternalReference objects instead.

cjmyers commented 4 years ago

Not sure about this one. I think he wants the encoding to be a data type attribute. Might be worth further thought.

cjmyers commented 4 years ago

Should be dealt with by creating some genome editing use cases to ensure we do not need to store and exchange very large sequences.