cajal / microns-nda-access

Provides instructions and Docker configurations for accessing the microns_phase3_nda schema and/or an isolated database container.
GNU General Public License v3.0
6 stars 7 forks source link

Docker image installation requires 300 GB disk space #14

Open PTRRupprecht opened 3 years ago

PTRRupprecht commented 3 years ago

Hi,

First of all, thanks for providing this interesting resource. I have tried to run docker on the downloaded tar-file but was unable in the beginning. The problem was that, after downloading the 100 GB file, I had only 190 GB disk space on my system disk (Ubuntu) left. The docker load --input functional_data_database_container_image_v7.tar however transiently used an additional ~200 GB on my system disk. I was also surprised to see this. Maybe you can check this on your own system while this command is executed.

Since I did not have enough space on this drive, I relocated the docker directories to another hard disk with more free space. I followed these instructions: https://stackoverflow.com/questions/24309526/how-to-change-the-docker-image-installation-directory/49743270#49743270 With these changes made, I successfully installed the docker image and managed to use it. However, I thought that others might find this workaround useful. And maybe you can include this information, if you can confirm this on your system, in your documentation. If the docker directory is located on the system disk, the system disk must have apparently ca. 200 GB of free storage space. After loading the image, only 100 GB will be used, but the process of loading transiently used ca. 200 GB.

Best wishes, Peter

Cpapa97 commented 3 years ago

That's also quite surprising to me. I have found that accessing the data in container seemingly uses quite a bit of disk space (likely because the original image layer is read-only, and I believe Docker creates a container layer on top of that that holds all of the changes to the running container), but I haven't noticed this issue when first loading the docker image from the tar file.

I will definitely need to test this out again in my Ubuntu environment to see if I run into this behavior as well, and then I'll update the Known Issues section of the README as you suggested.

The reason the data is stored directly in the image is to make it as simple as possible for users to load the image and have get a running, working database up with access to the data. However, in relation to the potentially exorbitant disk space costs maybe this isn't very compatible with how Docker handles container layers and now seemingly even in the process of loading the image to begin with. This is made more necessary because the original SQL dump file needs to be ingested first before its usable, and as finicky as that can be (with the large data packet sizes in this dataset) it's easier to distribute the image with the pre-ingested schema (though the SQL file is still distributed separately for those that want to do it themselves).

An alternative might be for me to ingest the SQL schema to the image, but keep the SQL data mounted outside of the container, and then package it all together in a way that it can be shipped and then loaded and started in a reasonably consistent manner by the users. Portability might still be a concern with that solution though.

PTRRupprecht commented 3 years ago

Thanks for your reply. It's certainly a difficult problem to provide such a complex data set in a user-friendly way. However, at least it worked for me in the end, so I cannot really complain :-)

Another option would be to provide a very reduced data set (e.g., only one scan and only 2000 instead of 50000 frames) in addition to the full data set, so that people can have a brief look. Many people like myself don't want to systematically analyze all data but just want to have a quick look at how things are like. For this purpose, a very small sub-set of the data set could be sufficient and would remove some of the hurdles. But this is just a suggestion - it depends on how much time you can invest into popularizing the data set.