data-dot-all / dataall

A modern data marketplace that makes collaboration among diverse users (like business, analysts and engineers) easier, increasing efficiency and agility in data projects on AWS.
https://data-dot-all.github.io/dataall/
Apache License 2.0
231 stars 82 forks source link

Improve experience for potential users and contributors #1471

Open zsaltys opened 2 months ago

zsaltys commented 2 months ago

On August 7th I tried to set up data.all in a local docker environment from scratch. Mind you I already have a lot of experience with data.all and know a lot of things so my experience was made much much smoother than it would have been for someone else who's never seen or worked with data.all. Overall it took me more than half a day to set up data.all in docker and have two environments linked and a share setup. I'll outline all of the issues which I ran into which need to be resolved.

  1. The deploy local page https://data-dot-all.github.io/dataall/deploy-locally/ suggests: git clone https://github.com/data-dot-all/dataall.git --branch v2.6.0 there is no such branch. It must be switched to tags.
  2. Local deploy page should mention versions of yarn and npm which are required.
  3. Local deploy page should mention that cdk is required and what version of it.
  4. Local deploy page should provide commands for OSX to install everything using brew python, npm, yarn, cdk etc.. If you're a tech manager and you want to quickly take a look at data.all you should not need to spend time trying to figure out how to install yarn or cdk. Ideally we should provide a docker image for folks with all dependencies they need so they could just try it out.
  5. Local page suggests to run export UID && docker-compose up first.. This is wrong. data.all will not work properly if you haven't first set up SSM with all the parameters, did not initiate aws credentials etc.. The steps should explain first that you will need aws credentials and to set up SSM parameters.
  6. In AWS credentials step we should explain what sort of permissions are required. I think we should suggest administrative permissions because that will be needed to run CDK.
  7. Current version 2.6.0 has a bug in data.all when checking maintenance window status because it's missing a DB record.. It fails graphql to start. We need to make sure we test docker deploy before every release.
  8. By default OSX runs Airplay receiver on port 5000... My colleague spent an hour figuring it out.. I actually forgot this issue and disabled that Airplay receiver.. We should either make it clear in local deploy or switch graphql to new port.
  9. docker-compose is configured to default to us-west-1 and worst of all it will override your .aws/config region ! This is very bad ! This should never be done and we need to make sure that local deploy explains to users to set their REGION properly or to let them know it will default us-west-1.
  10. The deploy local page mentions something on linking environments but it's very vague and not useful. There's a section on going further but that talks about deploying to AWS and not in docker mode. We should expand the deploy local page with further guidance how to create your organization and how to go all the way with setting up a share between a producer and consumer for a LakeFormation table. This is so that users get a full experience of data.all in 30 minutes.
  11. When going to link an environment page in local docker mode it's very unfriendly. First of all we for some reason default to running in manual pivot role create mode. This is wrong and we should default to auto.
  12. Custom CDK policy download button does not work in local mode... This is unfriendly to new users, they don't know where to find this file. We should very clearly suggest in the guide to use ADMIN mode.
  13. When actually linking in local mode it will not work because pivotRole has a constraint on "aws:PrincipalArn" in the trust policy. This needs to be relaxed in local dev mode.
  14. data.all is very slow in local docker mode and may leave a bad experience in new interested users mind.. Loading environments page takes 5 seconds+.
  15. The local deploy guide should suggest to disable some features during linking like SageMaker in case they cannot work without additional configuration such as VPCs.
  16. data.all currently has a bug in share processor for local mode which @noah-paige has a fix for.

Besides just listing out every single point of issue I think the overall goal is that if I asked a IT graduate who has seen data.all for the first time to run and try it out - I would want them to get to a fully working share in 30 minutes without having to google a single thing. They should not need to open any other web page besides the local documentation we offer and their terminal window. We should also make sure they can set up a share using a single AWS account.

noah-paige commented 2 months ago

Note - Item 16 here relates to PR https://github.com/data-dot-all/dataall/pull/1470

TejasRGitHub commented 2 months ago

Adding another item to the list - https://github.com/data-dot-all/dataall/issues/1479