biocommons / biocommons.seqrepo

non-redundant, compressed, journalled, file-based storage for biological sequences
Apache License 2.0
39 stars 35 forks source link

SeqRepo Query Produces KeyError (namespace not unique) #67

Closed akeeeshi closed 5 years ago

akeeeshi commented 5 years ago

Hello,

I was playing around with extracting sequences from SeqRepo for ensembl transcripts and noticed an issue that is thrown when there are multiple sequences returned throwing a KeyError.

Successful example with BRAF ensembl transcript:

In [18]: sr['ENST00000288602'] INFO:biocommons.seqrepo.fastadir.fastadir:Opening for reading: /usr/local/share/seqrepo/2017-07-04/sequences/2016/0824/050656/1472015216.5390692.fa.bgz Out[18]: 'CGCCTCCCTTCCCCCTCCCCGCCCGACAGCGGCCGCTCGGGCCCCGGCTCTCGGTTATAAGATGGCGGCGCTGAGCGGTGGCGGTGGTGGCGGCGCGGAGCCGGGCCAGGCTCTGTTCAACGGGGACATGGAGCC .... (shortened for readibility)'

Unsuccessful example with IGHM transcript

In [19]: sr['ENST00000390559']

KeyError                                  Traceback (most recent call last)
/Users/admin/.pyenv/versions/3.6.5/lib/python3.6/site-packages/hgvs/shell.py in <module>()
----> 1 sr['ENST00000390559']

/Users/admin/.pyenv/versions/3.6.5/lib/python3.6/site-packages/biocommons/seqrepo/seqrepo.py in __getitem__(self, nsa)
     65         # lookup aliases, optionally namespaced, like NM_01234.5 or NCBI:NM_01234.5
     66         ns, a = nsa.split(nsa_sep) if nsa_sep in nsa else (None, nsa)
---> 67         return self.fetch(alias=a, namespace=ns)
     68
     69     def __iter__(self):

/Users/admin/.pyenv/versions/3.6.5/lib/python3.6/site-packages/biocommons/seqrepo/seqrepo.py in fetch(self, alias, start, end, namespace)
     87         if len(seq_ids) > 1:
     88             # This should only happen when namespace is None
---> 89             raise KeyError("Alias {} (namespace: {}): not unique".format(alias, namespace))
     90
     91         return self.sequences.fetch(seq_ids.pop(), start, end)

KeyError: 'Alias ENST00000390559 (namespace: None): not unique'

It appears to not actually be a KeyError but rather that the same alias (e.g. ENST00000390559) exists for namespaces (i.e. ensemble versions). This is important because when trying to annotate using the HGVS package, the namespace is not specified. Is there a specific reason for this behavior?

reece commented 5 years ago

This behavior is expected and desired. As the error says, that accession does not point to a unique sequence.

Ensembl accessions without version numbers do not refer to a single sequence. You either need to specify the version number or the ensembl release in which the sequence occurred.

In [8]: sorted( (r["namespace"],r["alias"],r["seq_id"]) for r in sr.aliases.find_aliases(alias="ENST00000390559") )                                                                             
Out[8]: 
[('Ensembl-75', 'ENST00000390559', 'VtZEHJ1ekdQJeXuNmZ2eQ9vgsDFPSuUu'),
 ('Ensembl-76', 'ENST00000390559', 'MGA27-VbNXqbP2StRGixxhtcyhfynGYJ'),
 ('Ensembl-77', 'ENST00000390559', 'MGA27-VbNXqbP2StRGixxhtcyhfynGYJ'),
 ('Ensembl-78', 'ENST00000390559', 'MGA27-VbNXqbP2StRGixxhtcyhfynGYJ'),
 ('Ensembl-79', 'ENST00000390559', 'MGA27-VbNXqbP2StRGixxhtcyhfynGYJ'),
 ('Ensembl-80', 'ENST00000390559', 'MGA27-VbNXqbP2StRGixxhtcyhfynGYJ'),
 ('Ensembl-81', 'ENST00000390559', 'MGA27-VbNXqbP2StRGixxhtcyhfynGYJ'),
 ('Ensembl-82', 'ENST00000390559', 'MGA27-VbNXqbP2StRGixxhtcyhfynGYJ')]
akeeeshi commented 5 years ago

Thank you for the help. This clears things up.