biocommons / biocommons.seqrepo

non-redundant, compressed, journalled, file-based storage for biological sequences
Apache License 2.0
39 stars 35 forks source link

Add sequence-level annotations #113

Closed ahwagner closed 1 year ago

ahwagner commented 1 year ago

It would be useful for supporting downstream methods (e.g. circular sequence support #70) to store some basic characteristics about a sequence at the sequence level. This MAY be accomplished by adding these annotations to the FASTA key fields.

I think we would minimally like to have:

and in the event it is nucleic acid:

To accomplish this @ccaitlingo and I discussed extending the store and fetch methods of FastaDir to add these annotations to FASTA keys, in the following format:

>{digest}|{aa / na}|{linear / circular}|{single / double}

or a compressed version of the above (i.e. bitflags). Making this issue for discussion and progress.

ahwagner commented 1 year ago

Related issue for refget is still open (samtools/hts-specs#626), but conversation with @andrewyatz confirmed that this will not be addressed in upcoming RefGet v2 release, and it is not clear if there are plans for a RefGet v3 in the near term.

andrewyatz commented 1 year ago

Question for me is if seqcol would solve the issue for you or not. If not then we need to consider a next step.

reece commented 1 year ago

Based on a discussion with @andreasprlic and @ahwagner, we have decided to shelve this project. The rationale follows.

A core assumption of seqrepo is that sequences are referenced by computed identifiers and nothing else. It is impossible to preserve this feature while also making sequence identifiers aware of other properties like sequence type, topology/circularity, taxonomy, or anything else. Sequences need to remain as verbatim strings.

In principle, properties could be added to the sequence alias records. For example, the alias record could track whether the sequence type, circularity, strandedness, or anything else. This raises a slew of challenge issues:

For all of these reasons, we will not be adding sequence properties to seqrepo. Instead, if consumers need to know the sequence type, circularity, or strandedness, they will have to find another source for that info.