berkeley-dsep-infra / datahub

JupyterHubs for use by Berkeley enrolled students
https://docs.datahub.berkeley.edu
BSD 3-Clause "New" or "Revised" License
64 stars 39 forks source link

Explore Google File Store as a replacement for NFS! #3898

Closed balajialg closed 1 year ago

balajialg commented 1 year ago

We have been facing NFS issues due to a race condition in the Linux kernel which is hard to troubleshoot and resulted in a few outages in the past month. During our last team meeting, some of us were interested to explore Google FIle Store (GFS) as an alternative to NFS. @ericvd-ucb kindly agreed to do the outreach and reached out to Google Filestore folks using his contacts to have a conversation about GFS instead of NFS. GFS's point of contact outlined the below points in their response,

What you described is common when customers run NFS themselves. These consistency issues are hard to troubleshoot. We do have many customers running Filestore with multiple directories that in turn serve multiple users. The benefit of using a managed NFS solution like Filestore is that you don't have to manage NFS and simply get it out of the box. Filestore has multiple tiers (we recommended Filestore Enterprise to give you a HA solution by default) but you can also choose basic if you like. You only pay for the storage you consume. (as opposed to when you are running NFS yourself you probably are consuming compute from VMs and storage form PDs) The one thing to watch out for with Filestore Enterprise (based on what we hear from other customers) is the entry point of 1 TiB. You can of course consume the space by placing the directories of multiple users in the same Filestore instance, driving up utilization. In case you want isolation between users, you can also use multishares that share the underlying Filestore instance and drive up utilization. Outside this specific concern of the min entry size (that you can workaround based on solutions shared above), you get regional-backed storage, managed NFS, and just pay for the storage consumed and many customers use it at scale.

We need to evaluate whether what they proposed above is something we are interested to explore from a technical standpoint.

From my limited understanding, I looked at our billing report for the month of Oct 22 and found that their enterprise version (~$600 per month for 10 TiB) is at par with whatever we are spending for PD + snapshots (~4100 per month for 70 TiB). I am assuming I didn't miss anything in this calculation but please correct me if my interpretation is wrong.

image

To Do

ryanlovett commented 1 year ago

Thanks for looking into this @balajialg !

Are there published reports about real world use of Filestore and its reliability? Our nodes would still talk NFS to the Filestore and there could still be buggy NFS client behavior. In such cases, there would be no way to debug from the Filestore.

Can we monitor the Filestore with prometheus or is there some other method? (or is Filestore so reliable that we don't need to monitor it?)

Would everything be moved to Filestore, or would some nodepools move to Filestore while others would be kept on self-managed NFS?

Should performance be tested before and after a node is moved to Filestore?

@yuvipanda @felder Why did we migrate away from Filestore originally? Cost?

Recommended Linux client mount options

balajialg commented 1 year ago

This is all I could find through a Google search for case studies - https://cloud.google.com/filestore#section-5. Great questions and I will let the experts answer these questions and suggest the way forward. @ericvd-ucb Can we consolidate these questions and share the relevant ones with Filestore folks?

@ryanlovett Do you think it is the best use of our DevOps time to set up a conversation with them? or just do our preliminary investigation at our end before deciding whether we want to engage with them further?

ryanlovett commented 1 year ago

Do you think it is the best use of our DevOps time to set up a conversation with them? or just do our preliminary investigation at our end before deciding whether we want to engage with them further?

That's probably a question for @shaneknapp . :)

shaneknapp commented 1 year ago

yeah, let me look in to this a bit over the next couple of days.

from a glance, it looks like it'll be a little (maybe?) more expensive but (hopefully?) more reliable.

i'm also curious how 2i2c does this.

On Tue, Nov 1, 2022 at 9:54 AM Ryan Lovett @.***> wrote:

Do you think it is the best use of our DevOps time to set up a conversation with them? or just do our preliminary investigation at our end before deciding whether we want to engage with them further?

That's probably a question for @shaneknapp https://github.com/shaneknapp . :)

— Reply to this email directly, view it on GitHub https://github.com/berkeley-dsep-infra/datahub/issues/3898#issuecomment-1298828126, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMIHLBVZHRPXOLUQBHWAIDWGFDM5ANCNFSM6AAAAAARTR2EQQ . You are receiving this because you were mentioned.Message ID: @.***>

shaneknapp commented 1 year ago

answered my own question, re 212c, NFS and filestore: https://infrastructure.2i2c.org/en/latest/howto/operate/manual-nfs-setup.html?highlight=filestore

[image: image.png]

On Tue, Nov 1, 2022 at 10:41 AM shane knapp ☠ @.***> wrote:

yeah, let me look in to this a bit over the next couple of days.

from a glance, it looks like it'll be a little (maybe?) more expensive but (hopefully?) more reliable.

i'm also curious how 2i2c does this.

On Tue, Nov 1, 2022 at 9:54 AM Ryan Lovett @.***> wrote:

Do you think it is the best use of our DevOps time to set up a conversation with them? or just do our preliminary investigation at our end before deciding whether we want to engage with them further?

That's probably a question for @shaneknapp https://github.com/shaneknapp . :)

— Reply to this email directly, view it on GitHub https://github.com/berkeley-dsep-infra/datahub/issues/3898#issuecomment-1298828126, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMIHLBVZHRPXOLUQBHWAIDWGFDM5ANCNFSM6AAAAAARTR2EQQ . You are receiving this because you were mentioned.Message ID: @.***>

yuvipanda commented 1 year ago

@shaneknapp https://github.com/2i2c-org/infrastructure/issues/764 has info on longer term fixes that are being investigated as well

shaneknapp commented 1 year ago

this is great, thanks yuvi!

On Tue, Nov 1, 2022 at 12:25 PM Yuvi Panda @.***> wrote:

@shaneknapp https://github.com/shaneknapp 2i2c-org/infrastructure#764 https://github.com/2i2c-org/infrastructure/issues/764 has info on longer term fixes that are being investigated as well

— Reply to this email directly, view it on GitHub https://github.com/berkeley-dsep-infra/datahub/issues/3898#issuecomment-1299006077, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMIHLC5Q626HZJB274CYN3WGFVDXANCNFSM6AAAAAARTR2EQQ . You are receiving this because you were mentioned.Message ID: @.***>

felder commented 1 year ago

@ryanlovett yes I believe cost is the primary reason Google Filestore was not fully explored.

https://cloud.google.com/filestore/pricing vs https://cloud.google.com/compute/disks-image-pricing#disk

balajialg commented 1 year ago

Apparently, Google FIlestore was used for the Data 8x hub and the move to NFS happened due to pandemic-related cost cuts in March 2020. For more details, check out this issue - https://github.com/berkeley-dsep-infra/datahub/issues/1374

ryanlovett commented 1 year ago

@balajialg Looks like we moved a few hubs to Filestore in 2019.

Other commits: https://github.com/berkeley-dsep-infra/datahub/search?q=filestore&type=commits

balajialg commented 1 year ago

@ryanlovett awesome! It will be great if there is some billing-related info available during this duration when PRs get merged. I will work with @felder (if he has the time) to see if we can model costs for filestore based on our current usage.

balajialg commented 1 year ago

@shaneknapp @felder Any suggestions on the way forward with filestore exploration? Is this something we want to a) pursue and b) if yes then is this a priority for this semester? I was thinking we can get back to the Google Filestore PM about where we stand before the end of this week. If you all need more time then let me know.

shaneknapp commented 1 year ago

someone needs to investigate pricing... what we have vs same deployment on GFS.

On Wed, Nov 9, 2022 at 5:19 PM Balaji Alwar @.***> wrote:

@shaneknapp https://github.com/shaneknapp @felder https://github.com/felder Any suggestions on the way forward with filestore exploration? Is this something we want to a) pursue and b) if yes then this semester? I was thinking we can get back to the Google Filestore PM about where we stand before the end of this week. If you all need more time then let me know.

— Reply to this email directly, view it on GitHub https://github.com/berkeley-dsep-infra/datahub/issues/3898#issuecomment-1309636342, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMIHLBI4JM54LBZAZFBZWLWHREQ7ANCNFSM6AAAAAARTR2EQQ . You are receiving this because you were mentioned.Message ID: @.***>

balajialg commented 1 year ago

@shaneknapp I think we can parallel process this! Correct me if I am wrong - we want to evaluate whether the filestore solution is a) desirable and b) feasible. I can probably work with @felder or anyone else who has the time to figure out the feasibility part from the cost perspective. The more important question is whether this solution is even desirable - whether we want to invest the effort to do a pilot. I think you and @felder are best positioned to guide us with this decision (with support from Ryan and Yuvi).

balajialg commented 1 year ago

Updating the latest conversation with @felder about Google filestore. We definitely want to do a pilot implementation of filestore and based on that experience decide whether to transition all our hubs. @shaneknapp has also given a thumbs up about exploring filestore. We need to figure out when we want to scope this work which we can plan during Sprint Planning Meeting for December.

https://docs.datahub.berkeley.edu/en/latest/admins/storage.html#nfs-client still highlights that we use filestore for our Data 8x deployment. This needs to be corrected. I can spend some time to update this information asap.

balajialg commented 1 year ago

Yet another update here based on multiple discussions with the team. @felder and @shaneknapp will do a detailed analysis of Google Filestore and reconvene to discuss their learnings and the path forward. @ryanlovett is also doing his own research about filestore as he is thinking about moving at least the Stat 20 hub from the NFS server to Google Filestore and evaluating whether it resolves some of the NFS challenges. He also has a bunch of open questions that he wants the team to think about which he will add to this github thread. It is mostly around the information available in this doc - https://cloud.google.com/filestore/docs/creating-instances#instance_type

I have scoped the first half of our Sprint planning meeting on Dec 8th to discuss and decide the path forward with regards to moving our hubs to Google filestore.

ryanlovett commented 1 year ago

I'll discuss the Stat 20 aspect at the next meeting, but I definitely want to use Filestore for Spring '23. Some questions:

  1. Should there be per-hub instances, other aggregations like how hub disk/directories are configured now, or one big Filestore?
  2. What service level? Options are Basic HD, Basic SSD, Enterprise, and High Scale. Basic HD and Basic SSD are not limited in terms of size, but Enterprise is 1-10TB and High Scale is 10-100TB. Our largest disk consumers are 7-9TB so using Enterprise could be limiting if preserving the current hub/volume mapping, and Enterprise is also 2x the cost of High Scale. However High Scale only lets you resize in 2.5TB increments. That's about 10% of current utilization, so maybe those bumps aren't too painful in terms of headroom costs. Basic SSD seems very flexible, but is its performance sufficient?
  3. It isn't clear what the reliability differences are between service levels. I'm guessing they use the same NFS versions/implementations so there's probably nothing much to mention.
  4. We should extract and aggregate on server R/W IOPS and R/W throughput from prometheus. Currently we're seeing client figures. Then we can compare apples to apples for their service offerings.
  5. Using Filestore is more expensive so overprovisioning is more painful. We should set reasonable defaults but we'll have to monitor and scale up as time goes on. How will this happen?
  6. Given the cost, we'll have to monitor usage for large consumers, and apply downward pressure. A disk usage policy is very important. Should scaling disk be automated?
  7. What do we do if our NFS clients emit high test_stateid ops even after switching? We would no longer be able to affect this on the server side. We could monitor clients, and choose a service level with a sufficiently high ops/s.
  8. Setting a timeline for switching is important. I'd want to having something in place by the first week of January. I'm fine with deploying this for just Stat 20 if that is too aggressive for the other hubs.
ryanlovett commented 1 year ago

Regarding #3, @balajialg quoted the Google rep who said, "Filestore Enterprise [gives] you a HA solution by default." High scale and Enterprise both have: Non-disruptive maintenance: Supported and remain available during maintenance events, while Basic HDD/SSD are not supported.

https://cloud.google.com/filestore/docs/service-tiers

Regarding #5, the Google rep said, "just pay for the storage consumed", so overprovisioning would not be painful.

felder commented 1 year ago

@ryanlovett @balajialg @shaneknapp

Looking at: https://cloud.google.com/filestore/docs/service-tiers https://cloud.google.com/filestore/pricing https://cloud.google.com/filestore/docs/backups https://cloud.google.com/filestore/docs/snapshots

  1. I'm leaning toward one volume per hub.
  2. High scale has a 10TB minimum and no support for backups. Enterprise might be good, but it's also the most expensive tier at $0.60GB/month. Enterprise is also limited to a max of 10TB per volume. That's going to be a problem if we're not more aggressive about limiting/managing storage. Another consideration is the maximum number of recommended clients. Our busiest customers exceed the recommended limits for all of the tiers. Lastly pay special attention to the data recovery options. "Snapshots" to me looks like it'd be quite undesirable for our use case. "Backups" are only available for the Basic tiers. The naming here is somewhat odd. "Backups" seem to function more like persistent disk snapshots. "Snapshots" in this case do not function like persistent disk snapshots. Of particular note, deleting a file captured in a "snapshot" does not free the space on the filesystem. I'd really like to talk to someone at google about the ins and outs of these data recovery options.
  3. Enterprise has regional availability vs zonal for the other tiers, so I'd expect it to be more "available"
  4. True, but at the same time each tier has other features and costs which may be bigger factors in the decision. For example no data recovery options for the High Scale tier would make it a non starter IMO.
  5. Good question, I'm more concerned with monitoring and storage size limits than the cost of provisioning.
  6. Agreed, managing storage consumption due to server tier limits and cost is definitely going to be of increased importance. IMO this is already a major, but currently overlooked, consideration for the datahub service in general.
  7. Google support contract? I say that seriously because once we move to a managed service we are essentially handing over control of storage management (and debugging) to them.

@ryanlovett according to the pricing page, you pay for storage that is allocated (not just consumed). If the google rep said otherwise, that appears to be in conflict with the docs.

felder commented 1 year ago

There is also this service: https://cloud.google.com/filestore/docs/multishares

felder commented 1 year ago

Also came across this: https://cloud.google.com/community/tutorials/gke-filestore-dynamic-provisioning

Could this be used to provision per student pvcs of a fixed size?

ericvd-ucb commented 1 year ago

Potentially could talk to GCP folks about some of Ryan and Jons questions above, if its helpful - on Friday ?

balajialg commented 1 year ago

Next Steps from Sprint Planning Meeting:

Useful docs:

ryanlovett commented 1 year ago

More 2i2c links...

balajialg commented 1 year ago
shaneknapp commented 1 year ago

224M    a11y
821G    astro
4.4T    biology
32G cee
868K    cs194
163G    data101
240G    data102
347G    dlab
1.4T    eecs
2.7G    highschool
12G julia
30G prob140
30M shiny
1.4T    stat159
281G    stat20
440K    stat89a
17G workshop
408K    xfsconfig```
ryanlovett commented 1 year ago

@shaneknapp There's no stat89a deployment and it looks like it won't be taught next Spring too so you can skip that one.

It might be possible to skip shiny as well if the R hub rebuild fixes shiny-related issues (potentially fixed by repo2docker update). It wasn't used much in Fall. (Cc @ericvd-ucb)

balajialg commented 1 year ago

Thanks, @shaneknapp for the detailed storage report. Super insightful.

@shaneknapp @felder @ryanlovett Have a few questions related to our strategy for filestore creation. The spirit of the below questions comes from how can we be a good steward of RTL's extra $5k per month grant for our cloud usage. None of the below points are relevant to our major hubs like Datahub, R hub, I School, Stat 20, Biology, EECS, Public Health, Data 8, and Data 100 hubs.

Mini Filestore: I wonder if it makes sense to have a shared filestore for all the small hubs (based on storage) like a11y, CEE, CS 194, High school, Julia, Prob 140, Stat 89a, Shiny, and Workshop hub? No Filestore: Do we even need to create a shared file store for hubs like a11y, Shiny, Julia, High School, and Workshop? They are not actively used and have a seasonality to their usage. Based on what I heard from CEE, D-Lab and Econ 140 instructors, Most of the users of the hub which is occasionally used had a good experience with Datahub this semester. Medium Filestore: I understand that we want to isolate all the major hubs from each other. How about having a shared file store for medium storage hubs like Data 101, 102 and D-Lab which have storage of fewer than 350 GB? What are the benefits and pitfalls of this approach? How risky would that be?

ryanlovett commented 1 year ago

I'll defer to @shaneknapp and @felder about conserving filestore spend.

IMO, hubs which have a lot of users and/or I/O activity should be on separate filestores regardless of how much space they're using. It is the I/O burden that we wan't to keep separate. I believe that on some storage tiers, larger filestores perform better. That would be one reason to commingle deployments on the same filestore. If we had more first hand experience and data on performance and reliability then I might change my mind.

shaneknapp commented 1 year ago

completed syncs: astro, biology, data100, eecs, stat159, stat20 currently syncing: data8, datahub, ischool remaining to be synced: cee, data101, data102, dlab, prob140

shaneknapp commented 1 year ago

re conserving filestore spend. i'd much rather stick to the 1:1 ratio of course->hub->filestore. IO ops, compartmentalizing failures, etc.

shaneknapp commented 1 year ago

completed syncs: astro, biology, data100, eecs, stat159, stat20 currently syncing: data8, datahub, ischool remaining to be synced: cee, data101, data102, dlab, prob140

this is done, and i scaled up a bunch of instances.

see also https://docs.google.com/spreadsheets/d/1rj-iCpcHBcA_lUT7NXrOJaTpT2Le4fcb5Tank8D0ICQ/edit?usp=sharing

balajialg commented 1 year ago

Adding information about the courses using the smaller hubs and the student enrollment count to help with decision-making related to filestore allocation.

Smaller hubs (No. of courses using it, Total count of students enrolled as part of these courses, periodicity of usage)

ryanlovett commented 1 year ago

shiny hub was necessary when there was an issue with RStudio/R/R Graphics API/Shiny on the R hub. Hopefully that will go away with various version toggling. Though it was created with a separate set of home directories, my opinion is that when it is backed by filestore it should use the same filestore and node pool as R hub. export/homedirs-other-2020-07-29/shiny need not be copied anywhere.

Also, shiny hub was not used very much. Only a few people logged into it.

balajialg commented 1 year ago

Agreed, @ryanlovett! Most of the above mentioned hubs had really few people logging in during FA 22.

shaneknapp commented 1 year ago

https://github.com/berkeley-dsep-infra/datahub/pull/4072 https://github.com/berkeley-dsep-infra/datahub/pull/4073 https://github.com/berkeley-dsep-infra/datahub/pull/4075

all identified courses should be migrated to their own nodepool + GFS instance (except R hub, which shares infra w/datahub)

balajialg commented 1 year ago

@shaneknapp Closing this issue as you all have completed the pending tasks. Please feel free to reopen if there are any pending tasks to be tracked.