aicoe-aiops / data-science-workflows

GNU General Public License v3.0
12 stars 23 forks source link

Add Blog for "R vs Python for Data Science" #109

Closed aakankshaduggal closed 2 years ago

aakankshaduggal commented 2 years ago

Add blog from Sanjay's explanation for "R or Python to get familiar with data science" to docs/develop_collaborate. This document would serve as a good starting point for collaborators who are getting started with Data Science.

schwesig commented 2 years ago

just to capture the explanation. not the final destination nor layout here... by Sanjay Arora in https://chat.google.com/room/AAAARndRdLM/Dln4LLbZZwg (Chat in AICoE - Artificial Intelligence CoE Open Room) Thu 2022-02-03 11:00am

I started writing some thoughts but it got much longer than I wanted it to be. A lot of this might not apply to you but I'll post it for others who are interested in DS:

Data science is a very broad term and there are multiple options before you.

  1. In terms of basic languages, python is used heavily in industry and in computer science departments (where a lot of ML research happens). R is used heavily by groups doing "classical" statistics e.g. statistics, medicine/clinical groups, psychology etc. R has a very rich set of libraries in this arena while python is still lacking (although there are packages like statmodels). When it comes to ML, specially deep learning, reinforcement learning etc., python dominates and I don't know of anyone who uses R for those. Here python is almost always used as a prototyping language with all the core functionality implemented in a lower-level language (C++ mainly). So, a practitioner might write a neural network using PyTorch or numpy which, in turn, have parallelized SIMD or GPU based operations that are pre-compiled. An interesting alternative language is Julia but if you are starting out, I would stick to python and/or R.

  2. The day-to-day of data scientists can be vastly different. The situation is analogous to being a "software engineer". Person A might be a low-level kernel hacker while Person B might be writing front-end code in React. Same job title, very different skill sets even though both can write code.

There are few things to get a flavor of data science:

ML: The classic course used to Andrew Ng's Coursera course: https://www.coursera.org/learn/machine-learning. I would still recommend it. In addition, I would recommend understanding the mathematics carefully/slowly and implementing not just the assignments but each algorithm from scratch in python. This will help you get a flavor for how one approaches models in ML. You might never again touch the mathematics again but it'll give you an appreciation for the design and analysis of ML algorithms.

Data Manipulation: A huge part of data science jobs is getting the data in the right structure and format as well as exploring and checking ideas in the dataset. R has amazing libraries for this and in the python world, pandas essentially replicated these functionalities. Similarly, getting comfortable with any plotting library will be very useful. Numpy is another essential python package and a general principle is to replace as many loops in your loop by highly-optimized numpy functions as possible. The best way to do this is to pick a dataset from work or from kaggle and start using pandas and matplotlib to explore patterns/hypotheses.

Kaggle: A good way to apply all this is to pick a kaggle problem. Do pick one that has a tabular dataset and not images, text etc. Building models for a specific task that can be score is a great way to learn new modeling techniques. A downside of kaggle is that most real world problems are not that well-defined and don't require getting an extra 0.001% accuracy. Models on kaggle tend to be too complicated (an ensemble of 50 models for example) but even with these caveats, kaggle is a great way to learn practical modelling.

Data Pipelines: This is arguably what a data engineer does but more data scientists now spend their time with some pipelining infrastructure. At Red Hat, you can look up operate-first (https://www.operate-first.cloud/) which has Kubeflow pipelines installed. One can specify these pipelines in python (https://www.kubeflow.org/docs/components/pipelines/sdk/sdk-overview/) and execute large jobs in parallel. This is a crucial part of a data scientist's job since they generally come up with data transformations and joins etc. and having the ability to execute these workflows on large datasets is critical. https://www.coursera.org/learn/machine-learning

  1. The other crucial part that many people discount is their own domain expertise. Many data scientists get stuck at step 2. They know enough techniques and programming but have a hard time going deep into a problem. All problems then seem to have the same flavor: get a CSV file, make some plots, train a simple model and show it to someone who says "eh? that's not very useful" and nothing happens. Rinse and repeat.

A bit reason for this is the lack of domain knowledge/expertise. In almost every scientific field, the data scientist is actually a physicist, chemist, psychologist, mathematician (numerical experiments) etc. who has deep understanding of their field and picks up the necessary techniques to analyze their data. They have a set of questions they want to ask and they have the knowledge to interpret the results of their models and experiments. A problem with being a data scientist without that expertise is that all the focus falls on the techniques and not the actual questions. It strips away the ability to generate new experiments. The only solutions are to either work with an expert or even better, to start learning the field one is interested in. This does take a long time but pays rich dividends.

So, going at this the other way, i.e. where one knows a field and learns skills from point 2., can be very powerful.

  1. In many cases, there's also the option of going deep into the techniques. Here are some examples. A big caveat is that some of these are very specialized and generally need a lot of dedicated time. Also, most data scientists will never touch most of what's below.

Deep Learning: learning not just the basics of neural networks and their architectures but learning to design new ones. Understanding the tradeoffs in their design. Getting comfortable with the tools e.g. PyTorch, GPU kernels, possible some C, Julia code etc. that let one carry out diverse experiments and scale them. And reading a lot of papers. https://paperswithcode.com/ is a great resource. Note that there are specialized sub-fields like computer vision which have do a lot more than throw a convolutional neural network at an image.

Reinforcement Learning: this is even more specialized but it's fast growing, intellectually rich field. Again, this involves essentially reading and understanding (and implementing) lots of papers, identifying sub-threads that one finds interesting and applying them or extending them. It's generally more mathematical than deep learning. Here are some books/resources:

http://www.incompleteideas.net/book/the-book-2nd.html

A great online course: https://rail.eecs.berkeley.edu/deeprlcourse/

A collection of papers: https://sites.google.com/view/berkeley-cs294-190-fa21/week-1-reward-free-pre-training-and-exploration

Hierarchical Models: These are extremely powerful tools and a great (R-based) book is: https://xcelab.net/rm/statistical-rethinking/

To give a sense of how specialized things can get, there's the sub-field of optimal statistical decision making. E.g. see: https://www.amazon.com/Optimal-Statistical-Decisions-Morris-DeGroot/dp/047168029X or https://tor-lattimore.com/downloads/book/book.pdf (related to reinforcement learning).

  1. Lastly, a philosophical point: there are two extreme viewpoints. One is to know which tool to use, pick up a pre-implemented tool online and apply it to one's problem. This is a very reasonable approach for most practical problems. The other is to deeply understand how and why something works. This approach takes much more time but gives one the advantage of modifying/extending the tool to make it more powerful.

The problem with the first approach is that one something doesn't work, it's easy to give up since one doesn't understand the internals. The problem with the second approach is that it generally must be accompanied by application to problems (practical or not) otherwise it's easy to sit in a theoretical void.

My very opinionated advice is to do both. Always apply the techniques to problems. The problems can be artificial where you generate the dataset or they can be real problems. See where they fail and where they succeed. But, also don't ignore the math. The goal is also to understand and not just use. The understanding almost always has some mathematical elements. Initially, it might seem like a foreign language but eventually it lets one generate new ideas and see connections that are just hard to see otherwise. Sometimes the mathematics can seem gratuitous but even then, it is a post-hoc justification which suggests new prediction and new experiments to do.