alan-turing-institute / data-safe-haven

https://data-safe-haven.readthedocs.io
BSD 3-Clause "New" or "Revised" License
60 stars 15 forks source link

'Normal' azure data science VM set up for Imperial challenge as well as the HD Insights cluster? #137

Closed cathiest closed 5 years ago

cathiest commented 5 years ago

@getcarter21 Is the additional open (i.e. non-DSG) azure environment all set up with access to the LANL data as well, so that those who don't know how to use the HD Insights cluster set up can use a normal Azure data science VM?

Tagging @helen9344 as facilitator

(Extract of email chain below)

Hi Catherine

Fair enough - I hadn’t spotted that.

Your pragmatic suggestion of two environments (under the same resource group so they can both access the data in the storage) sounds sensible. I’m speaking with Ian C later to figure out the HDInsight side of things. I’ll also chat to him about the standard analysis cluster. Given the time that’s available, if the participants don’t know Big Data technologies, then it’s probably best that they focus on what they do know. This makes mine and Ian’s life considerably simpler.

Best wishes,

Mark

Get Outlook for iOS


From: Catherine Lawrence clawrence@turing.ac.uk Sent: Tuesday, December 4, 2018 10:59 am To: Mark Briers; Marya Bazzi; xxxxx@imperial.ac.uk; mturcotte; xxxx@imperial.ac.uk; Mihai Cucuringu Cc: Sebastian Vollmer; Jules Manser; Ian Carter; Hallgren, Karl Y; Hogan, Jack; Martin O'Reilly Subject: RE: Turing DSG - Imperial/LANL

Hi Mark

Emails with the team from Imperial back on 1 Oct said “a lot of work on anomaly detection has already been done in this space, and so one of our objectives for this workshop is to broaden out the scope of data science work in cyber. The red team attack is there for those who would like a concrete classification challenge, but, if anything, we are more interested in seeing some creativity in the other two areas of visualisation and data fusion, and so we are happy to leave this open.”

So I don’t think the big focus is on finding the red team activity? With that in mind, is a sub set of data useful after all, for the other elements of the challenge? There was previous mention of a 16GB subset from LANL.

Also, as big data wasn’t discussed as one of the skills until later when we were discussing the 2 pager, that wasn’t highlighted as a necessary skill during participant recruitment. We added “Prior experience working with PySpark or Hadoop would also be useful but is not necessary” to the 2 pager. The DSG might well be a good opportunity for participants new to PySpark/Hadoop to learn this, from other participants who have these skills already.

Would it make sense to have the usual Azure analysis environment set up as well as the HD Insights cluster set up for analysis, so participants who are not familiar with PySpark/Hadoop/using HD insights can use the more usual tools to look into the data?

Best wishes Catherine

getcarter21 commented 5 years ago

Looks like there is data in this blob storage

[cid:3835bb94-c2b1-4092-b324-bade41ffef79]


From: cathiest notifications@github.com Sent: 08 December 2018 17:04:43 To: alan-turing-institute/data-safe-haven Cc: Ian Carter; Mention Subject: [alan-turing-institute/data-safe-haven] 'Normal' azure data science VM set up for Imperial challenge as well as the HD Insights cluster? (#137)

@getcarter21https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgetcarter21&data=02%7C01%7Cicarter%40turing.ac.uk%7C445c4c1a7e4e4d7d4d5a08d65d2f4094%7C4395f4a7e4554f958a9f1fbaef6384f9%7C0%7C0%7C636798854858186776&sdata=OxdN%2FOk1%2BeFqEOSoqNaJC5RJcuAniRoGas%2FCe%2BDmTRU%3D&reserved=0 Is the additional open (i.e. non-DSG) azure environment all set up with access to the LANL data as well, so that those who don't know how to use the HD Insights cluster set up can use a normal Azure data science VM?

Tagging @helen9344https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fhelen9344&data=02%7C01%7Cicarter%40turing.ac.uk%7C445c4c1a7e4e4d7d4d5a08d65d2f4094%7C4395f4a7e4554f958a9f1fbaef6384f9%7C0%7C0%7C636798854858196790&sdata=rmkyohe%2FbUAq%2BWdGvqaipz3pDXzpZMu7L8CeDFZf9E8%3D&reserved=0 as facilitator

(Extract of email chain below)

Hi Catherine

Fair enough - I hadn’t spotted that.

Your pragmatic suggestion of two environments (under the same resource group so they can both access the data in the storage) sounds sensible. I’m speaking with Ian C later to figure out the HDInsight side of things. I’ll also chat to him about the standard analysis cluster. Given the time that’s available, if the participants don’t know Big Data technologies, then it’s probably best that they focus on what they do know. This makes mine and Ian’s life considerably simpler.

Best wishes,

Mark

Get Outlook for iOS


From: Catherine Lawrence clawrence@turing.ac.ukmailto:clawrence@turing.ac.uk Sent: Tuesday, December 4, 2018 10:59 am To: Mark Briers; Marya Bazzi; xxxxx@imperial.ac.ukmailto:xxxxx@imperial.ac.uk; mturcotte; xxxx@imperial.ac.ukmailto:xxxx@imperial.ac.uk; Mihai Cucuringu Cc: Sebastian Vollmer; Jules Manser; Ian Carter; Hallgren, Karl Y; Hogan, Jack; Martin O'Reilly Subject: RE: Turing DSG - Imperial/LANL

Hi Mark

Emails with the team from Imperial back on 1 Oct said “a lot of work on anomaly detection has already been done in this space, and so one of our objectives for this workshop is to broaden out the scope of data science work in cyber. The red team attack is there for those who would like a concrete classification challenge, but, if anything, we are more interested in seeing some creativity in the other two areas of visualisation and data fusion, and so we are happy to leave this open.”

So I don’t think the big focus is on finding the red team activity? With that in mind, is a sub set of data useful after all, for the other elements of the challenge? There was previous mention of a 16GB subset from LANL.

Also, as big data wasn’t discussed as one of the skills until later when we were discussing the 2 pager, that wasn’t highlighted as a necessary skill during participant recruitment. We added “Prior experience working with PySpark or Hadoop would also be useful but is not necessary” to the 2 pager. The DSG might well be a good opportunity for participants new to PySpark/Hadoop to learn this, from other participants who have these skills already.

Would it make sense to have the usual Azure analysis environment set up as well as the HD Insights cluster set up for analysis, so participants who are not familiar with PySpark/Hadoop/using HD insights can use the more usual tools to look into the data?

Best wishes Catherine

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Falan-turing-institute%2Fdata-safe-haven%2Fissues%2F137&data=02%7C01%7Cicarter%40turing.ac.uk%7C445c4c1a7e4e4d7d4d5a08d65d2f4094%7C4395f4a7e4554f958a9f1fbaef6384f9%7C0%7C0%7C636798854858196790&sdata=WzjHQ4s2lLsvwiZiABpYtrMsZUOr1WwAlsIV0sNpzYg%3D&reserved=0, or mute the threadhttps://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAg9FoR6bTUwBCapQNOB0GSZdveX5Xc5vks5u2_ErgaJpZM4ZJuPL&data=02%7C01%7Cicarter%40turing.ac.uk%7C445c4c1a7e4e4d7d4d5a08d65d2f4094%7C4395f4a7e4554f958a9f1fbaef6384f9%7C0%7C0%7C636798854858206795&sdata=Pk13Mqxzw5JA9S7HH%2FTDIh8RePn%2FoWEVOYfbETY7KqQ%3D&reserved=0.

martintoreilly commented 5 years ago

LANL data is in the "dsgimperiallanl" storage account as a blob container named "lanl-data'

cathiest commented 5 years ago

Is the additional standard Azure data science VM also set up in same resource group to access same data store?

martintoreilly commented 5 years ago

Closing as DSG Dec 2018 specific