georgeamccarthy / protein_search

The neural search engine for proteins.
GNU Affero General Public License v3.0
15 stars 6 forks source link

Shuffle protein data on first load and then store in memmap (on disk) instead of in memory. #54

Open georgeamccarthy opened 3 years ago

georgeamccarthy commented 3 years ago

PR type

Purpose

Why?

Feedback required over

Mentions

Future work

References

Legal

georgeamccarthy commented 3 years ago

Added a feature to log number of culled proteins.

fissoreg commented 3 years ago

Future work

  • Currently using pandas to shuffle the data. One could use the jina built in .shuffle (see cookbook). However I couldn't get this working properly.

Apparently the shuffle method is a recent addition: https://github.com/jina-ai/jina/commit/2302e456165810c9d9f8d6df1505a0aabd2edc76

It will work if you upgrade:

pip install --upgrade jina
georgeamccarthy commented 3 years ago

Great find! TODO :)