eugeneyan / eugeneyan-comments

1 stars 1 forks source link

https://eugeneyan.com/writing/end-to-end-data-science/ #4

Open eugeneyan opened 3 years ago

eugeneyan commented 3 years ago

Migrated from json into utteranc.es

eugeneyan commented 4 years ago

Great article! I strongly echo all the points mentioned. IMO, people should care more about the big picture and purpose, not about a title or job description.

Everybody is their own entrepreneur and the world won't care if your work doesn't deliver value at the end. This is true for all professions and not just data scientists.

I used to feel I didn't like writing tests and devops work, and had an illusion to become a pure "Kaggle style" data scientist. Such thing doesn't exist in the real world. When I tried to build a full stack ML powered app by myself I had to face every part of the work, and what's interesting is that I discovered I really like software engineering and even find more sense of accomplishment iterating on the app itself than the model. The mindset is most important, never limit oneself with some self-created myth "I only work in this area".

Comment by Logan Yang on 2020-08-10T15:53:35Z

eugeneyan commented 4 years ago

Great article! I strongly echo all the points mentioned. IMO, people should care more about the big picture and purpose, not about a title or job description.

Thanks for the feedback Logan! Yes, fully agree that delivering value is the key outcome that we should strive towards. Oddly enough, I find that being end-to-end, while it may seem like more work, it actually makes it easier to deliver value.

Comment by Eugene Yan on 2020-08-10T23:09:05Z

eugeneyan commented 4 years ago

I completely agree with StitchFix wanting Data Scientists to be end to end. Your personal anecdotes about encountering a problem and solving it yourself I run into all the time. Yes it is hard to learn new things constantly, but to overcome these types of roadblocks takes energy and creativity, and isn't that one of the things people want in Data Scientists? The energy spent on defining what a Data Analyst is vs a Data Scientist vs a Data Engineer vs an Analytics Engineer are all meaningless. I don't even know what to call myself anymore, I work with data.

Comment by Anonymous on 2020-08-18T20:12:58Z

eugeneyan commented 4 years ago

Completely agree! I have been developing training programs and courses in “data engineering science” and need to go one-step further in including MLE role as well and then it’d be what one could call “end-to-end data engineering scientist”.

Comment by Raazesh Sainudiin on 2020-08-20T06:36:22Z

eugeneyan commented 4 years ago

Great article! Working in an engineering team that sometimes deploys ML solutions and runs regular A/B tests, this really resounded to me.

Comment by Anonymous on 2020-08-22T17:13:57Z

eugeneyan commented 4 years ago

not sure if having only developers test their own code is such a great idea.

Comment by Anonymous on 2020-08-25T22:25:11Z

eugeneyan commented 3 years ago

In software development, the movement to DevOps was entirely about developers deploying and maintaining their own code. It was a reaction against the frequent failures of “throwing releases over the fence” to release managers or system admins and having deployment issues and developers shrugging and saying “it works on my machine”. Now DevOps has become a role in an of itself and really it’s just a modern sysadmin in the cloud who uses CI/CD tools and was probably once a developer. TDD is also something that was supposed to make developers more end-to-end, but it’s rarely done well and it seems more and more popular to release early and fix things in the wild. Early release access and beta testing with customers. It’s good that in the data science world all these things are coming up but know that it’s all happened before.

Comment by Anonymous on 2020-08-30T14:31:07Z

rvallejov commented 3 years ago

This is really great! I've experienced this first-hand: initially by being the only first data science hire and later at a bigger tech firm. It is clear to me that being end-to-end is more motivating and more effective at delivering value.

Any tools you recommend to become more end-to-end? I know Metaflow tries to solve many of these problems. What is you recommendation on this?

eugeneyan commented 3 years ago

Thanks for the kind words Raul!

Thanks for sharing about Metaflow, wasn't aware of it before. With regard to tooling, I tend to use tools that enable fast prototyping and can scale for small-medium deployments. These include numpy (and pandas), mlflow, fastapi, etc. Most of my go-to models are also production-ready (e.g., xgboost, lightgbm, pytorch).

sonnuk commented 3 years ago

Great work! thank you for sharing this useful information about data science, really enjoyed while reading the blog.

YikSanChan commented 3 years ago

On a side note, I find the old saying interesting:

I hear and I forget. I see and I remember. I do and I understand. – Confucius

Then I am curious how to say that in Chinese. Then I find the thread that says it is actually from Xunzi. The origin is quote below:

不闻不若闻之,闻之不若见之,见之不若知之,知之不若行之;学至于行之而止矣。

But nvm, Xunzi himself is a big Confucian haha

eugeneyan commented 3 years ago

Wow this is a great find. I like Xunxi's original quote better as its more meaningful and has the hierarchy of hearing, seeing, knowing, and doing.

AmirAktify commented 3 years ago

I'm going to respectfully disagree with this post, but mostly because I think there's a hidden false assumption - namely, that if a data scientist doesn't own multiple parts of the production cycle end to end, then you have a whole chain of people waiting on each other and potential blockers to production which would impede delivery. This can be true in many cases, but it doesn't have to be.

Suppose instead that you have the right infrastructure built out rather than having data scientists be generalist experts in everything. And you have a very scalable and easy to use data lake from which data scientists can get data, and you have a templatized pipeline where models can be dropped in (airflow jobs, MLFlow upload).

Then the ML and data engineer orgs can do one off efforts to setup infrastructure. Data scientist can get data and drop in models without dependency overheads.

And everyone can focus on their best skills and specialization, without having to re-sync and align and manage multiple hand offs on each new project between Data engineers, ML engineers and data scientists. (And here I do agree, doing this repeatedly and from scratch would slow down delivery and success heavily - you want to setup processes so that the infrastructure and templated pipelines do most of the heavy lifting in a lubricated fashion.)

The main reason I advocate more for this approach, is that a lot of data science, and even data engineering and ML engineering problems are very deep if you start going for hyper optimized models. And the "person wearing many hats" will not necessarily converge at the strongest solutions, if you have frequent context switching and generalization overhead involved.

xuanswe commented 1 year ago

Hi,

A nice article, thanks for sharing!

First of all, I would like to share what I believe about full-stack software engineer job. In the past, I thought that becoming a full-stack software engineer is a good idea. But it turns out that I am wrong. I should spend 80% on my favorite area as a backend engineer and max 20% on everything else with priority to what more helpful to my specialized area. For example, frontend development, devops, monitoring skills are more helpful to me than ux/ui design, etc.

Now, I am thinking about becoming a ML engineer and therefore I found this article. From my experience as a software engineer, I think, I am going to apply the same rule to my future ML engineer job. I would definitely create personal ML project(s) to be an end-to-end person there. I would love to "temporary" wear many hats in a ML team when necessary and try to keep myself up to date on all hats regularly. Basically, I would split my effort with 80% for my specialized area and 20% for everything else in the end-to-end product delivery workflow. The 20% should be at the level "I do and I understand."

What do you think about my approach?