data-dot-all / dataall

A modern data marketplace that makes collaboration among diverse users (like business, analysts and engineers) easier, increasing efficiency and agility in data projects on AWS.
https://data-dot-all.github.io/dataall/
Apache License 2.0
229 stars 82 forks source link

SageMaker Studio long start-up time #352

Closed atzev closed 1 year ago

atzev commented 1 year ago

Describe the bug

As a user, when I open SageMaker Studio from the data.all portal I have to wait several minutes before I can start using Studio.

How to Reproduce

*P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.*

New user:

  1. Log in to data.all with a user who has access to create SageMaker Studio profiles
  2. Go to ML Studio and create new ML Studio profile
  3. Click on launch JupyterLab
  4. Amazon SageMaker Studio creates the app
  5. Studio starts loading and displays "Loading... the loading screen is taking a long time. Would you like to clear the workspace or keep waiting?"
  6. The user has to click Keep Waiting or clear workspace for 2-3 times until Studio opens probably 10 or more minutes later

Returning user:

  1. Log in to data.all with a user who has access to SageMaker Studio profiles
  2. Go to ML Studio and click on launch JupyterLab
  3. Studio starts loading and displays "Loading... the loading screen is taking a long time. Would you like to clear the workspace or keep waiting?"
  4. The user has to click Keep Waiting or clear workspace for 2-3 times until Studio opens probably 10 or more minutes later

Expected behavior

For existing users, Studio load within seconds after the use clicks to launch Jupyter Lab icon. For new users, Studio should launch as soon as the app is created.

Your project

No response

Screenshots

No response

OS

Mac

Python version

3.8

AWS data.all version

1.4.1

Additional context

No response

dlpzx commented 1 year ago

Hi @atzev, I can confirm this behavior. I will pass this information internally to see if we can pick up this task and improve the user experience when using ML Studio from data.all

tasneemfazl commented 1 year ago

Hello Team, do we have any fix or workaround for this issue?

dlpzx commented 1 year ago

A colleague has done some investigation and has a explanation for the issue. Basically in the CDK code, the SageMaker Domain is defined with “VPCOnly” together with the default VPC (which has only public subnets): https://github.com/awslabs/aws-dataall/blob/main/backend/dataall/cdkproxy/stacks/environment.py#L581 SageMaker Domain doesn’t support this (VPCOnly + public subnet): Doc link stating the requirements (no 1.) that when using VPCOnly mode the Subnets must be private All the subnets available in the chosen VPC are used to create the Domain (might include a mix of public, private and isolated - probably we just want private and isolated and fail the deployment otherwise if the list is empty) https://github.com/awslabs/aws-dataall/blob/main/backend/dataall/cdkproxy/stacks/environment.py#L555 In other words, it assumes that the environment account has a default VPC with Private Subnets. From my tests (I created 2 environment accounts from scratch in Isengard) the default VPC does not have private subnets by default, so you have to manually modify the VPC to create them.

A snippet of code that would work is to create the right VPC for SageMaker Studio with VPCOnly mode:

# LINE 553
            # try: # COMMENTING THIS PIECE WHERE WE DO NOT WANT TO USE A DEFAULT VPC
            #     default_vpc = ec2.Vpc.from_lookup(self, 'VPCStudio', is_default=True)
            #     vpc_id = default_vpc.vpc_id
            #     subnet_ids = [private_subnet.subnet_id for private_subnet in default_vpc.private_subnets]
            #     subnet_ids += [public_subnet.subnet_id for public_subnet in default_vpc.public_subnets]
            #     subnet_ids += [isolated_subnet.subnet_id for isolated_subnet in default_vpc.isolated_subnets]
            # except Exception as e:
            #     logger.error(f"Default VPC not found, Exception: {e}. If you don't own a default VPC, modify the networking configuration, or disable ML Studio upon environment creation.")

            # Create VPC with 3 Public Subnets and 3 Private subnets wit NAT Gateways
            log_group = logs.LogGroup(
                self,
                f'SageMakerStudio{self._environment.name}',
                log_group_name=f'/{self._environment.resourcePrefix}/{self._environment.name}/vpc/sagemakerstudio',
                retention=logs.RetentionDays.ONE_MONTH,
                removal_policy=RemovalPolicy.DESTROY,
            )
            vpc_flow_role = iam.Role(
                self, 'FlowLog',
                assumed_by=iam.ServicePrincipal('vpc-flow-logs.amazonaws.com')
            )
            vpc = ec2.Vpc(
                self,
                "VPC",
                max_azs=3,
                cidr="10.10.0.0/16",
                subnet_configuration=[
                    ec2.SubnetConfiguration(
                        subnet_type=ec2.SubnetType.PUBLIC,
                        name="Public",
                        cidr_mask=24
                    ),
                    ec2.SubnetConfiguration(
                        subnet_type=ec2.SubnetType.PRIVATE_WITH_NAT,
                        name="Private",
                        cidr_mask=24
                    ),
                ],
            )
            ec2.FlowLog(
                self, "StudioVPCFlowLog",
                resource_type=ec2.FlowLogResourceType.from_vpc(vpc),
                destination=ec2.FlowLogDestination.to_cloud_watch_logs(log_group, vpc_flow_role)
            )

            vpc_id = vpc.vpc_id
            subnet_ids = [private_subnet.subnet_id for private_subnet in vpc.private_subnets]
            sg = vpc.vpc_default_security_group
            sagemaker_domain = sagemaker.CfnDomain( # this is the same as before from here to the end
dlpzx commented 1 year ago

Fixed in #409