biocommons / biocommons.seqrepo

non-redundant, compressed, journalled, file-based storage for biological sequences
Apache License 2.0
39 stars 35 forks source link

Incomplete sequence data with the 2019-09-19 release #73

Closed izcram closed 4 years ago

izcram commented 4 years ago

I'm encountering issues with the 2019-09-19 seqrepo data release.

Following the example in https://github.com/biocommons/biocommons.seqrepo#quick-start working with seqrepo 0.5.2:

seqrepo -r /Users/marc/nobackup/seqrepo pull -i 2019-09-19

Then, in ipython the following fails:

In [2]: from biocommons.seqrepo import SeqRepo
   ...: sr = SeqRepo("/Users/marc/nobackup/seqrepo/2019-09-19/")
   ...: sr["NC_000001.11"][780000:780020]
   ...:
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-2-f45145297c99> in <module>
      1 from biocommons.seqrepo import SeqRepo
      2 sr = SeqRepo("/Users/marc/nobackup/seqrepo/2019-09-19/")
----> 3 sr["NC_000001.11"][780000:780020]

~/software/anaconda/lib/python3.7/site-packages/biocommons/seqrepo/seqrepo.py in __getitem__(self, nsa)
     69         # lookup aliases, optionally namespaced, like NM_01234.5 or NCBI:NM_01234.5
     70         ns, a = nsa.split(nsa_sep) if nsa_sep in nsa else (None, nsa)
---> 71         return self.fetch(alias=a, namespace=ns)
     72
     73     def __iter__(self):

~/software/anaconda/lib/python3.7/site-packages/biocommons/seqrepo/seqrepo.py in fetch(self, alias, start, end, namespace)
     95
     96     def fetch(self, alias, start=None, end=None, namespace=None):
---> 97         seq_id = self._get_unique_seqid(alias=alias, namespace=namespace)
     98         return self.sequences.fetch(seq_id, start, end)
     99

~/software/anaconda/lib/python3.7/site-packages/biocommons/seqrepo/seqrepo.py in _get_unique_seqid(self, alias, namespace)
    214         seq_ids = set(r["seq_id"] for r in recs)
    215         if len(seq_ids) == 0:
--> 216             raise KeyError("Alias {} (namespace: {})".format(alias, namespace))
    217         if len(seq_ids) > 1:
    218             # This should only happen when namespace is None

KeyError: 'Alias NC_000001.11 (namespace: None)'

Interestingly, the directory size of /Users/marc/nobackup/seqrepo/2019-09-19/ is 690M, which is way smaller than previous releases.

Am I missing anything or is there an issue with the data repository at http://dl.biocommons.org/seqrepo/2019-09-19/ ?

Thanks in advance for looking into this!

acoffman commented 4 years ago

I am encountering the same issue for what its worth. I rolled back to the 2019-06-20 release.

reece commented 4 years ago

Apologies folks. It appears that 2019-09-19 was corrupted. We're investigating the cause now. I've just yanked it from downloads.

reece commented 4 years ago

Summary

The 2019-09-19 (which was actually released recently) was incomplete. A new release, 2020-04-02, follows the 2019-06-20 release and adds all NCBI sequences released since, adding 10565 sequences and 358472 aliases. seqrepo 0.5.3 was recently released and is required to pull any release after 2019 (see #74).

Cause

The loading workflow is to load sequences into the master instance (directory), then snapshot as a yyyy-mm-dd and move to the download area. During a clean-up, someone (probably me) dropped or moved the master directory. For the next loading iteration, master was silently recreated by seqrepo and therefore contained only sequences that were part of that update.

Note: seqrepo 0.5.2 matched directory names with a regexp that started ^201\d. This filtered out directories from 2020. This was fixed in #74 and will be released imminently as 0.5.3.

Fix

N.B. I actually built in /tmp, which is much faster than EFS on AWS. Hence the -r /tmp below.

Before

(default-3.7) biocommons@ip-10-30-1-68:~/seqrepo$ seqrepo -r /tmp show-status -i master
seqrepo 0.5.2
instance directory: /tmp/master, 11.5 GB
backends: fastadir (schema 1), seqaliasdb (schema 1) 
sequences: 878228 sequences, 103007484724 residues, 275 files
aliases: 10121032 aliases, 9767422 current, 48 namespaces, 878228 sequences

Loading

(default-3.7) biocommons@ip-10-30-1-68:~$ seqrepo -r /tmp load -n NCBI $(cat /tmp/sources)

After

(default-3.7) biocommons@ip-10-30-1-68:~$ seqrepo -r /tmp show-status -i master
seqrepo 0.5.2
instance directory: /tmp/master, 11.6 GB
backends: fastadir (schema 1), seqaliasdb (schema 1) 
sequences: 888793 sequences, 103057783776 residues, 278 files
aliases: 10479504 aliases, 10125894 current, 48 namespaces, 888793 sequences

@andreasprlic

reece commented 4 years ago

Follow-up: A new release, 2020-04-13, was just released. It fixes a number of smaller issues with recent releases.