IGS / gEAR

The gEAR Portal was created as a data archive and viewer for gene expression data including microarrays, bulk RNA-Seq, single-cell RNA-Seq and more.
https://umgear.org
GNU Affero General Public License v3.0
13 stars 4 forks source link

Pre-load some spatial datasets #894

Open adkinsrs opened 2 days ago

adkinsrs commented 2 days ago

Sure, we could do this after the uploader step is created (#892), but I feel that it would be better to just pre-load some spatial datasets our own way. One reason is that we can test other tools developed (#890) without having the uploader as a blocker. Another reason is that can already have established the ready-to-go format and stored file structure that the uploader should go into.

adkinsrs commented 2 days ago

Spoke with @jorvis about using Google Filestore as a test space, since we previously discussed having to move our datasets off the VM for performance reasons. Google Buckets would also work but Filestore would be easier to integrate with our current code bases that use filepaths.

Google Filestore overview -> https://cloud.google.com/filestore/docs/overview

adkinsrs commented 2 days ago

Haven't provisioned anything yet, but here's what I'm thinking for the Filestore instance. Prices for us-east1 are actually about $50 more dollars than I listed because the examples cite us-central1

So in a way, I don't think this would be terribly cost-efficient until we actually put data onto the FileStore to use. It would be a pretty inefficient use of resources to pay $260/month for me to test 1 spatial dataset until things were working and we could add more.

More info -> https://cloud.google.com/filestore/docs/service-tiers

adkinsrs commented 2 days ago

The other option would be to use one of Google's block-storage systems instead of their file-storage system (which I described above). I read up on the differences, and it seems the biggest difference is in file-storage, the management occurs on Google's side, whereas for block-storage you receive the disk block and then configure and manage the filesystem yourself (on the server).

Block storage (like Hyperdisk) also seems much cheaper than the file-storage options I quoted above. 1 TB of Hyperdisk balanced-provisioned space is $90/month. I believe you mount the disk to the VM just like in the other cases, but I need to just read up more to get a feel for the flow of things.

https://cloud.google.com/compute/disks-image-pricing#disk

Found this cool flowchart that may also answer some questions as well -> https://cloud.google.com/static/architecture/images/storage-advisor.svg

Based on the flowchart, it seems like Filestore-zonal would be the best candidate but I wouldn't rule out Persistent Disk-Zonal or Hyperdisk Balanced due to potentially better costs, if the integration and flow are right

adkinsrs commented 20 hours ago

Using traditional Google bucket (object) storage is also an option and even cheaper (~$20 TB/month). We would have to use FUSE to mount the bucket to our VMs though -> https://cloud.google.com/storage/docs/gcsfuse-mount

I think from a strict requirements perspective, we do not NEED file-based access with respect to datasets. Generally, with the exception of saved analyses, all h5ads are stored in a flat location. But performance would take a hit, as we would need to enable caching ensure reading data would be faster.