det-lab / jupyterhub-deploy-kubernetes-jetstream

CDMS JupyterHub deployment on XSEDE Jetstream
0 stars 1 forks source link

Extend and renew XSEDE project #41

Closed zonca closed 3 years ago

zonca commented 4 years ago

@pibion the allocation ends on October 27th, I suggest you ask first for an extension of 6 months, see https://portal.xsede.org/allocations/policies#356 (in the text also mention you'd like to extend ECSS). You can check how many hours are left on the allocation and decide how much supplement we might need, if any, we expect to keep using hours at the same rate of the last 3/4 months.

Then, in December, you could apply for a renewal (which starts in April): https://portal.xsede.org/allocations/research#xracquarterly

pibion commented 3 years ago

Okay, I've requested an extension (for Jetstream and ECSS).

It looks like we will need to request a supplement, the allocation is down to 6%. I'm trying to see how I can get info on consumption rate.

zonca commented 3 years ago

See resource usage here: https://www.github.com/det-lab/jupyterhub-deploy-kubernetes-jetstream/tree/master/DEPLOY.md

I think we are using ~20k SU/month now with 1 medium and 1 xlarge

pibion commented 3 years ago

Okay, our extension request has been approved!

With ~20k SU/month, does a supplement of 120k SU (for the next six months) sound reasonable?

pibion commented 3 years ago

Ack, wait, nothing is showing up in my allocations now! Checking into it...

zonca commented 3 years ago

Yes right 120k would be enough

zonca commented 3 years ago

@pibion I'll need to update the workplan for the extension, I will mostly focus on having dask working and support your testing. Anything else we should mention?

pibion commented 3 years ago

@zonca I'm working on the supplement request, and there's a question about how many virtual machines and how many IPs we'll need. Does it make sense to put zero here, or would you recommend putting the number we're using now (4?) to continue those resources?

pibion commented 3 years ago

@zonca Dask and testing support sounds complete to me. In particular I'm expecting we'll be testing the shared data volume a bit in the upcoming months. We do sometimes see slow startup; I'm not sure if that's worth calling out explicitly. When it's really bad we make tickets, but I'm expecting to find out how much of an issue it is once more people are using the system.

Myself and a few students have been working on getting data into our catalog, which is what we need for people to be able to start doing routine analysis tasks at XSEDE. Don't know if that's useful for your report, but we are expecting a slow but steady increase in test users.

zonca commented 3 years ago

@pibion, most of the dask deployment is done, so we can focus more on usability and performance.

Here is what I wrote in the extended workplan:

Extension: We had already defined 2 stretch goals in the original workplan: We already performed benchmarking of object store and we realized that adapting all their software to rely on that would be too cumbersome, so we closed that line of work. Dask deployment instead seems interesting and I had a initial test deployment that wasn't working, so I'll focus on fixing it and help them make use of it With dask deployment the system will be complete, so we can work on performance and usability, usability is mostly small tasks for example automating the maintenance of the available kernels, or installing a new editor or improve cluster monitoring; performance instead is deeper and it might involve redesigning some pieces of the architecture, for example I am a bit worried that NFS could be a bottleneck and we should try the CVMFS plugin for kubernetes https://gitlab.cern.ch/cloud-infrastructure/cvmfs-csi which I decided not to use due to lack of documentation, but I could give it a try now that I understand better CVMFS.

pibion commented 3 years ago

@zonca your extended workplan looks excellent to me.

The progress report I wrote for the extension is at https://docs.google.com/document/d/1AXvxwEAD2VbO4oCBuMJcGGzgqSdQPxBrbqywHCySMUw/edit?usp=sharing.

Should I include Extended Collaboration Support and storage with the supplement request?

zonca commented 3 years ago

better specify we want to keep the same amount of storage we had before, ECSS is already active in the extension, so no need to add it also to the supplement.

pibion commented 3 years ago

@zonca thanks! I've submitted the supplement request.

I'll keep a lookout for the opening of the science allocations in December.

pibion commented 3 years ago

Eeee I just got a notification we're out of allocation and that compute will be suspended.

I did get the supplement request in yesterday, so hopefully that will get approved before too long.

zonca commented 3 years ago

@pibion better if you also open a help ticket and explain the situation, just to make sure

pibion commented 3 years ago

@zonca the supplement has been approved.

I'm still getting a "service unavailable" when I try to access supercdms.jetstream-cloud.org. I emailed XSEDE support and they seemed to say that our resource wouldn't be de-allocated, just that no more computation would be allowed. So that seems consistent with not being able to spin up a resource.

My main concern is that there's a user (@mbaiocchi) who may have work that's only on XSEDE. I have several undergraduates who've also been working on XSEDE, but I work with them directly to version and push all their code.

I'm wondering if there's a way to access the existing storage volumes, and I'm also wondering if there's a way to set a permanent backup with e.g. the Open Storage Network (CDMS has an allocation there now). Adding @glass-ships and @thathayhaykid as they might be interested in thinking about this.

zonca commented 3 years ago

it might be unrelated, I see all the volumes, it seems docker is stuck on the head node, rebooting the machine

zonca commented 3 years ago

ok, it should be fixed now. moved issue about backup to #44

pibion commented 3 years ago

@zonca excellent, thanks! I know I said this before, but I'll plan on submitting a science allocation request for these resources in December per your suggestion. The review for the supplement was positive and recommended moving towards a science allocation.

zonca commented 3 years ago

@pibion submissions opened last week, deadline Jan 15th:

XSEDE Research Allocation Requests: Open Submission available until January 15, 2021 for awards starting April 1, 2021 XSEDE is now accepting Research Allocation Requests for the allocation period, April 1, 2021 to March 31, 2022. The submission period is from December 15, 2020 thru January 15, 2021. The XRAC panel will convene March 8, 2021 with notifications being sent by March 15, 2021. Please review the new XSEDE systems and important policy changes (see below) before you submit your allocation request through the XSEDE User Portal

https://portal.xsede.org/allocations/research

pibion commented 3 years ago

@zonca excellent, I've got a shell at https://www.overleaf.com/3955184126cjbtcqmdjjzd (to see it you have to be logged into Overleaf).

I'm hoping to have a rough draft done some time next week. I think my dashboard will have the information I need to do the resource request justification.

One question I do have is whether it might make sense to put in another supplement request in the interim - only 15% of the allocation is left. I think this is because we have a few new users consistently working on the platform, which is great!

zonca commented 3 years ago

@pibion can you please check if usage is consistent with my estimate of 20K SU/month?

pibion commented 3 years ago

@zonca looking into it. The summary seems to aggregate the original allocation and the supplement, so it's not immediately clear how to get an estimate of use over e.g. the last two months.

They let you download CSV data (perfect!) but it's not totally clear to me how it should be interpreted. I've opened up a ticket, I suspect I'm missing some documentation somewhere.

pibion commented 3 years ago

Gah, well, a response from XSEDE support suggests that none of the numbers I'm seeing on the XSEDE user portal should be trusted (this is possibly restricted to Jetstream?).

His internal numbers show that this project is actually overdrawn on SUs. I've asked if there's any way to access these numbers myself. After a slightly-more-thorough search, I still can't find any XSEDE documentation about checking SU usage.

But I think that his numbers are likely the correct ones, because when I try to access the XSEDE jupyterhub I get an error like, "0/2 nodes available."

zonca commented 3 years ago

@pibion nodes are up now.

Yes, please also ask for a supplement to get us to April, as we don't have any better estimate, let's add a bit of margin, so I would ask:

so a supplement of 113K

zonca commented 3 years ago

@pibion Jeremy confirmed my calculations, for the supplement I suggest to ask for 113K.

For the renewal:

pibion commented 3 years ago

@zonca okay, I've submitted the supplement request. Will get going on the Science Allocation request tomorrow!

pibion commented 3 years ago

@zonca great news, the supplement has been speedily approved. Working on the Science Allocation request today.

pibion commented 3 years ago

@zonca I've finally gotten something useful down for the allocation request. If you have any time to glance at Section 3, I'd be happy for any comments.

zonca commented 3 years ago

sure, I activated change tracking

zonca commented 3 years ago

ok done @pibion, it looks good to me

pibion commented 3 years ago

@zonca okay awesome, thank you! Some of the resource justification stuff is repeated in a few places but I'd rather have things easy to find. I'm going to go through the evaluation checklist one more time, I'll let you know once I've submitted.

pibion commented 3 years ago

Submitted!

zonca commented 3 years ago

we should get result of the proposal next week

pibion commented 3 years ago

@zonca our proposal was approved!

zonca commented 3 years ago

excellent! the reviews were also good. Soon I'll write a final report for this year and we will start to think about a workplan for next year.

pibion commented 3 years ago

Sounds excellent to me :)

The email said that someone from ECSS would contact me to set up a time to meet for workplans, but would it be okay if I just reached out myself to set it up?

zonca commented 3 years ago

Someone from ECSS should be me ...