clulab / reach

Reach Biomedical Information Extraction
Other
97 stars 39 forks source link

bioresources should be standard text files, version controlled, directly readable, etc. #743

Open kwalcock opened 3 years ago

kwalcock commented 3 years ago

There are probably good reasons that the bioresources are stored in gzip files, but maybe it's time to revisit them. It is incredibly difficult (for people spoiled by large hard drives and fast network connections, etc.) to do very useful things with them like observe how they have changed over time or even just read them. Only one of the files, uniprot-proteins.tsv, expands to a size larger than the 100MB limit that GitHub imposes. Although there are probably other repercussions, it's just a text file and could be easily split into two parts. If need be, there are ways to create gzip files for deployment during the packaging process. The files we have in kb/ner aren't very large, so it seems like that shouldn't be necessary. It would be so great if they were just there like all the other files.

bgyori commented 3 years ago

I've been dreaming about this for a long time! Looking at the files with vi has worked for me without decompression but comparing versions for diffs is really a pain. I think the only issues are with the file size limit and the fact that at the level of interacting with the repo itself, things would get more bulky and a bit slower (if there are large diffs being carried around in the git history).

MihaiSurdeanu commented 3 years ago

This was my call at the time because of file size limits in github. If we can uncompress files and still push them, I am all in favor!

enoriega commented 3 years ago

I did a quick test, the repo's size is 778 MB with gziped files and 992 MB without compression. This is barely under the 1GB limit of the free tier. However, I am not sure if trying to push unzipped files will accumulate the sizes due to versioning or if the quota is computed by the size of HEAD.

I will fork it and test it on my personal account

Another option is to use GitHub's Large File Storage (LFS) and pay $5 a month. That gives us 50GB storage for the repo.

kwalcock commented 3 years ago

Ah, I wasn't aware of a per repo limit. Are you sure? https://docs.github.com/en/github/managing-large-files/what-is-my-disk-quota and https://stackoverflow.com/questions/38768454/repository-size-limits-for-github-com mention other numbers. When I've checked the LFS possibility before, there was a troublesome data transfer limit to worry about. Even if there is enough space, moving the data back and forth might still be a problem.

enoriega commented 3 years ago

You're right @kwalcock. It is a recommended size. I did the test in my personal fork. Uncompressed all the gz files. There is one, uniprot-proteins.tsv that exceeds the 100 MB hard limit on per-file size. However, it can be split into multiple files and pushed this way. It worked well.

Of course, this will require some refactoring of bioresources to account for the split, which shouldn't be too complicated ...