alan-turing-institute / data-training-for-bioscience

Introduction to Data Science Project Management for Project Leaders.
Other
9 stars 1 forks source link

Proposal of topics and masterclasses + presentation #16

Open LydiaFrance opened 2 years ago

LydiaFrance commented 2 years ago

Hi @fedenanni and @malvikasharan

This is the presentation I've put together based on our last meeting. It provides the motivation/overview of what I think the training should cover. I think it would be good to present this to the Crick soon.

https://docs.google.com/presentation/d/1vwRLcj01DiPJ_OWqQzMb8PQxbQgv9dNwyj8jku02UUM/edit?usp=sharing

fedenanni commented 2 years ago

Ehi @LydiaFrance for me it looks great!! If @malvikasharan is happy I think you can reach out to the Crick and set up a meeting soon to go though this with them! Great work!

(I’m back on Monday but I saw the notification and was pretty curious :D )

malvikasharan commented 2 years ago

This looks great to me Lydia. Please do send it to the full list of recipients. For the specific ask, please tag James B., James F. and Rebecca in your email. I have to add another person to the thread which I will do now, and you can take things from there. Pinging @KirstieJane if she would like to add something.

malvikasharan commented 2 years ago

Comment by James Briscoe:

I would aim for fewer than 5-6 distinct classes. I doubt most GLs would commit to that many. One option would be to aim for ~2 core sessions and then additional sessions focused on specific issues. The two core sessions might be along the lines “Best practices in managing computational biology projects” and “An overview of AI/ML and deep learning in biomedical research for group leaders”.

It will be important that course material is as practical as possible and tailored to biologists, using examples typical of what we do. This is an issue for all computational training (and interdisciplinary training more generally) – content of many computational courses, beyond introductory level, tend to come from disciplines that have different working practices and use examples that are not directly relevant to what most of us do. (As an example: While there are undoubtedly things we can learn from agile working, biomed research is never going to be the same as software development.)

It would be useful to involve some STP leads from the core facilities that are data heavy (I’m thinking BABS and Imaging) so that the course material is directly aligned with what happens at the Crick. Along the same lines, James Turner is leading on Open Science and reproducibility in the Crick and it will probably be helpful to get his input.

malvikasharan commented 2 years ago

Comment by Lydia:

With a new PhD student, which of these training courses are compulsory? I’m trying to get an idea about the role of a group leader in directing their students to training, whether the student has to volunteer, or whether they all go through a basic level of training.
Looking at the inhouse training, I can’t access the Crick intranet so I can’t see what is taught. It would be particularly helpful to see what is in the “Data Science Specialisation with R”, “Programming courses on Tutorialspoint” , “Crick data challenge”.

The training I’m writing is from a top-down perspective and designed for the Group Leaders, but part of that is giving the leaders a view of what their lab members should know. This should therefore feed into the Crick training and so I’ll need to flag up anything that might be missing. For example, I can’t see from the training titles about version control/git (the exception is in the R Advanced Courses from Jumping Rivers). This will be an important part of the training course I’m building, and if there are no obvious resources for lab members to learn it then that’s a problem.

It would also be helpful to know about the data management expectations within the Crick, and if there is a standardised data management process.

Response from James Fleming:

None are compulsory, and there is no particular guidance on progression at the moment either – it’s very much a conversation between student and GL. It’s the first focus area we’ve taken away to try and group by level, and then by ‘flow’, ie what predecessors you should take. The Crick data challenge is an event, rather than a training course – it’s designed to pair experts from scientific computing, bioinformatics and computational labs with more ‘wet lab’ scientists with an aim of solving novel problems. Many of these then spawn ongoing projects over time. Tutorialspoint is available here: https://www.tutorialspoint.com/computer_programming/index.htm Johns Hopkins Data Science specialisation here: https://www.coursera.org/specializations/jhu-data-science

Agree around the gaps – again, something we are looking at ourselves. Not crowdsourced views yet, but at a glance, the gaps for me are at least the following: Software engineering practice – version control, backlog management, unit testing, quality assurance Software architecture – principles of design, services, APIs, code documentation Software engineering management for teams – agile methodology, source management, sprint management, CI/CD frameworks, deployment management Infrastructure – HPC, VMs, Cloud, Containers Building and managing databases – choosing the right technology, relational, non-relational, graph, schemas, efficient query design Tensorflow/Nextflow – designing and optimising effective pipelines Many areas of AI & ML – architecture of networks, designing effective approaches, understanding data suitability, safety & ethics, discoverability/transparency of outcomes Data visualisation – Shiny, PowerBI, Tableau etc. FAIR Data and effective data management

I’m sure there are many more!

On your question around data management, our policies currently stop at the basics, where to store it, retention policies, expectation around publication etc. There is a lot to do around embedding consistent data approaches in experimentation, metadata management etc. Looping Karen Ambrose who leads the Research Data Services team, and is leading the programme to address these areas.