biocommons / biocommons.seqrepo

non-redundant, compressed, journalled, file-based storage for biological sequences
Apache License 2.0
39 stars 35 forks source link

Add support for derived sequences #92

Closed reece closed 2 years ago

reece commented 3 years ago

Problem

It's often necessary to derive sequences from others, especially genomic sequences used for alignment (see Heng Li's blog post).

As a specific example, GRCh38 official sequences use lower case to indicated masked regions and use IUPAC ambiguity characters. The sequence lengths are the same, but the content is different. Nonetheless, people want to refer to them using names like "GRCh38:1" or perhaps even "NC_000001.11".

Example transformations are 1) uppercase, 2) replace ambiguity with N, 3) reverse complement, and combinations of these.

Possible solutions

. Precompute all transformed sequences and treat like all other sequences. This will be very expensive in space.

. Enable transformations on read during connection (e.g., SeqRepo(root=..., uppercase=True)). This works well for uppercasing and ambiguity replacement (which are likely constant for the session), but is impractical for reverse complement. By changing python list semantics, we could co-opt negative coordinates for this (e.g., sr["NM_01234.5"][-1000:-900] would provide the rev comp of that range)

. Create namespaces that imply certain transformations. For example, GRCh38uc (or GRCh38/uc) might indicate a uppercase transform of GRCh38.

Notes / Challenges / Constraints

reece commented 2 years ago

See #94