Closed izcram closed 4 years ago
I am encountering the same issue for what its worth. I rolled back to the 2019-06-20
release.
Apologies folks. It appears that 2019-09-19 was corrupted. We're investigating the cause now. I've just yanked it from downloads.
The 2019-09-19
(which was actually released recently) was incomplete. A new release, 2020-04-02
, follows the 2019-06-20
release and adds all NCBI sequences released since, adding 10565 sequences and 358472 aliases. seqrepo 0.5.3 was recently released and is required to pull any release after 2019 (see #74).
The loading workflow is to load sequences into the master
instance (directory), then snapshot as a yyyy-mm-dd and move to the download area. During a clean-up, someone (probably me) dropped or moved the master
directory. For the next loading iteration, master
was silently recreated by seqrepo and therefore contained only sequences that were part of that update.
Note: seqrepo 0.5.2 matched directory names with a regexp that started ^201\d
. This filtered out directories from 2020. This was fixed in #74 and will be released imminently as 0.5.3.
master
was recreated from the 2019-06-20
release, which was the most recent complete sequence set2020-04-02
was created and pushed into the download areaN.B. I actually built in /tmp, which is much faster than EFS on AWS. Hence the -r /tmp below.
(default-3.7) biocommons@ip-10-30-1-68:~/seqrepo$ seqrepo -r /tmp show-status -i master
seqrepo 0.5.2
instance directory: /tmp/master, 11.5 GB
backends: fastadir (schema 1), seqaliasdb (schema 1)
sequences: 878228 sequences, 103007484724 residues, 275 files
aliases: 10121032 aliases, 9767422 current, 48 namespaces, 878228 sequences
(default-3.7) biocommons@ip-10-30-1-68:~$ seqrepo -r /tmp load -n NCBI $(cat /tmp/sources)
(default-3.7) biocommons@ip-10-30-1-68:~$ seqrepo -r /tmp show-status -i master
seqrepo 0.5.2
instance directory: /tmp/master, 11.6 GB
backends: fastadir (schema 1), seqaliasdb (schema 1)
sequences: 888793 sequences, 103057783776 residues, 278 files
aliases: 10479504 aliases, 10125894 current, 48 namespaces, 888793 sequences
@andreasprlic
Follow-up: A new release, 2020-04-13, was just released. It fixes a number of smaller issues with recent releases.
I'm encountering issues with the
2019-09-19
seqrepo data release.Following the example in https://github.com/biocommons/biocommons.seqrepo#quick-start working with
seqrepo
0.5.2:seqrepo -r /Users/marc/nobackup/seqrepo pull -i 2019-09-19
Then, in ipython the following fails:
Interestingly, the directory size of
/Users/marc/nobackup/seqrepo/2019-09-19/
is 690M, which is way smaller than previous releases.Am I missing anything or is there an issue with the data repository at http://dl.biocommons.org/seqrepo/2019-09-19/ ?
Thanks in advance for looking into this!