Pre-load some spatial datasets

adkinsrs commented 2 days ago

Sure, we could do this after the uploader step is created (#892), but I feel that it would be better to just pre-load some spatial datasets our own way. One reason is that we can test other tools developed (#890) without having the uploader as a blocker. Another reason is that can already have established the ready-to-go format and stored file structure that the uploader should go into.

adkinsrs commented 2 days ago

Spoke with @jorvis about using Google Filestore as a test space, since we previously discussed having to move our datasets off the VM for performance reasons. Google Buckets would also work but Filestore would be easier to integrate with our current code bases that use filepaths.

Google Filestore overview -> https://cloud.google.com/filestore/docs/overview

adkinsrs commented 2 days ago

Haven't provisioned anything yet, but here's what I'm thinking for the Filestore instance. Prices for us-east1 are actually about $50 more dollars than I listed because the examples cite us-central1

zonal (us-east1-b) with initial 1Tb capacity - restricted to only that zone but our gEAR VM is on that zone. Zonal pricing is much cheaper than regional ($256/month min per TB vs $461)
There is a "basic" option that is $164/mo per TB for HDD and $768 for SSD) but I don't think either are good option. The HDD option is static (no room for growth) and the SDD allows for capacity growth but at a much higher cost than the zonal/regional options. Therefore I would not recommend the "basic"
Both zonal and regional auto-grow or auto-shrink at 0.25Tb increments with a minimum of 1 Tb. However you have to choose between a 1Tb-9.75Tb capacity or a 10Tb-100Tb capacity (same rate but 2.5Tb scaling increments). So if we were to choose the lower-end capacity option, and go into 10Tb of data, then we would have to make a new filestore as the choice is permanent. Again, zonal is cheaper and would align well with our zonal gEAR VMs
The other options, like mount point name and network connections are things we can control that should not have too much bearing on costs.

So in a way, I don't think this would be terribly cost-efficient until we actually put data onto the FileStore to use. It would be a pretty inefficient use of resources to pay $260/month for me to test 1 spatial dataset until things were working and we could add more.

More info -> https://cloud.google.com/filestore/docs/service-tiers

adkinsrs commented 2 days ago

The other option would be to use one of Google's block-storage systems instead of their file-storage system (which I described above). I read up on the differences, and it seems the biggest difference is in file-storage, the management occurs on Google's side, whereas for block-storage you receive the disk block and then configure and manage the filesystem yourself (on the server).

Block storage (like Hyperdisk) also seems much cheaper than the file-storage options I quoted above. 1 TB of Hyperdisk balanced-provisioned space is $90/month. I believe you mount the disk to the VM just like in the other cases, but I need to just read up more to get a feel for the flow of things.

https://cloud.google.com/compute/disks-image-pricing#disk

Found this cool flowchart that may also answer some questions as well -> https://cloud.google.com/static/architecture/images/storage-advisor.svg

Based on the flowchart, it seems like Filestore-zonal would be the best candidate but I wouldn't rule out Persistent Disk-Zonal or Hyperdisk Balanced due to potentially better costs, if the integration and flow are right

adkinsrs commented 20 hours ago

Using traditional Google bucket (object) storage is also an option and even cheaper (~$20 TB/month). We would have to use FUSE to mount the bucket to our VMs though -> https://cloud.google.com/storage/docs/gcsfuse-mount

I think from a strict requirements perspective, we do not NEED file-based access with respect to datasets. Generally, with the exception of saved analyses, all h5ads are stored in a flat location. But performance would take a hit, as we would need to enable caching ensure reading data would be faster.

IGS / gEAR

Pre-load some spatial datasets #894