alan-turing-institute / bridge-data-platform

Repository that manages the Kubernetes JupyterHub deployment that hosts the 3D bridge data platform
MIT License
1 stars 1 forks source link

How do we get the Bridge data into the Hub? #12

Open sgibson91 opened 4 years ago

sgibson91 commented 4 years ago

Summary

We need a mechanism of pulling the Bridge data into the Hub (once #10 has been resolved). I've heard talk from Autodesk of webhooks, so it could be that we set up a cron job to activate these at regular intervals.

I emailed Alex Tessier asking him to put me in touch with the relevant people in his team to help set this up. He responded saying that they were close to finalising the packaging of the Time Analysis pipeline and would report back to me by the end of the week.

sgibson91 commented 4 years ago

https://docs.microsoft.com/en-us/azure/aks/azure-nfs-volume

sgibson91 commented 4 years ago

https://docs.microsoft.com/en-us/azure/storage/files/storage-how-to-use-files-cli#use-your-azure-file-share

martintoreilly commented 4 years ago

Can you connect with @jemrobinson @thobson88 and @fedenanni about this? They are looking at how to securely access Azure storage from the Safe Haven. While they have additional requirements around restricting network security rules for connecting to storage accounts from safe haven secure research environments, the right answer for authentication and access control is likely to be similar.

We definitely shouldn't be using storage account keys directly for access. Kerberos authentication and Shared Access Signatures are both options that look like they can be used for more fine grained access control.

martintoreilly commented 4 years ago

You can use Azure Active Directory Domain Services (Azure AD DS) to access Azure Files via SMB using Kerberos authentication.

sgibson91 commented 4 years ago

Would that mean that all users would need an Azure account in the Turing tenancy or no?

martintoreilly commented 4 years ago

Would that mean that all users would need an Azure account in the Turing tenancy or no?

Short answer: Yes, but this is also the requirement to access any Turing Azure resource now.

Long answer: I think all users will need an account in the same Active Directory, but not necessarily the Turing's AD. It would be possible to have a dedicated AD that all users have accounts on. In this case, I think that users on other ADs such as the Turing's can be added to as guests and use their credentials from their home AD.

sgibson91 commented 4 years ago

Short answer: Yes, but this is also the requirement to access any Turing Azure resource now.

Can you elaborate on that? The JupyterHub is served on a public IP, access to which I was going to restrict with authentication and by controlling the source IP addresses the requests are made from (i.e. Turing IP only).

martintoreilly commented 4 years ago

I think the Kerberos SMB access is primarily suited for the scenario where users have login accounts on a VM that wants to mount the Azure Files storage.

For a data transfer job, I think we probably want to look at either (i) one of the Azure messaging services (Event Grid, Even Hubs or Service Bus) and / or Azure Function Apps

sgibson91 commented 4 years ago

Will need to think about what putting the storage account inside the VNET means for data ingress. https://github.com/alan-turing-institute/bridge-data-platform/issues/10#issuecomment-597180117

sgibson91 commented 4 years ago

@jemrobinson can we have a meeting about data ingress at some point?

Some kind of system where ingesting new data involves creating a new storage account, placing it inside the VNET, and minting a new SAS key. My understanding is that a similar process happens manually at the start of a DSG. Here, we'll be (close to) live-streaming the data on (maybe) a daily frequency? Seems like the DSH model might break down at this point.

jemrobinson commented 4 years ago

Yes, but I think it would be helpful to have @warwick26 involved as well - he knows much more about the mechanics of what we currently do for ingress than I do.

sgibson91 commented 4 years ago

Meeting notes with @jemrobinson and @warwick26: https://hackmd.io/@sgibson91/bridge-data

sgibson91 commented 4 years ago

I've emailed Alex and the Autodesk team again hoping to get some info on what the process of pulling data across actually looks like

sgibson91 commented 4 years ago

I've emailed Alex and the Autodesk team again hoping to get some info on what the process of pulling data across actually looks like

They have a Python API for this - I've been asked to be pointed in the direction of it. I also mentioned caching options and they said it's not possible yet but something they think would be useful and are hopefully working towards.

sgibson91 commented 4 years ago

Meeting notes from chat with Autodesk

sgibson91 commented 4 years ago

Potentially this is the way to go: https://jupyterhub-on-hadoop.readthedocs.io/en/latest/index.html