biocommons / biocommons.seqrepo

non-redundant, compressed, journalled, file-based storage for biological sequences
Apache License 2.0
39 stars 35 forks source link

RFC: How to handle sequence casing and other transformations #94

Open reece opened 3 years ago

reece commented 3 years ago

Problem Summary

There are several flavors of GRCh38. All are coordinate compatible but have distinct sequences. "Official" GRCh38 sequences are uppercase and contain ambiguity characters. Ensembl replaces ambiguity characters with N. hg38 from UCSC represents repeat regions with lower case.

SeqRepo needs a way to preserve the original sequence verbatim, but also to support commonly used transformations, and to make this choice apparent to users.

Background

The GRC defines official genomic references, which includes the assembly name, member accessions, nucleotide sequences, alternate assemblies, etc. For an example, see GCF_000001405.26 assembly report.

According to GRCh38, the sequences referred to by GRCh38:1 and refseq:NC_000001.11 is a (masked) sequence w/ambiguity characters. It is unacceptable to hijack these identifiers to mean another sequence. However, these sequences are very usable as-is because no one expects lower case in the genomic sequence, for example. (Embedding annotations like masking into sequences is a mistake.)

Because the GRC sequences are inconvenient to use as-is, UCSC and Ensembl transform the sequences to be more useful. The transformations preserve coordinates, but change the sequence by upper-casing. Thus, we have two versions of each sequence for a given assembly.

While supporting case-squashing and disambiguating sequences, it should also be possible to support reverse complement and circular sequences and coordinates.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been stalled for 7 days with no activity.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 7 days.