Supporting Data Science in Carbon

Summary of issue:

Where does Carbon stand with respect to the domain of data science and scientific computing?

Details:

Motivation

C++ has an important role in the field of scientific computing and industry as the language design principles promote efficiency, reliability and backward compatibility - a vital tripod for any long-lived codebase. Other ecosystems such as Python have prioritized better usability and safety while making some tradeoffs on efficiency and backward compatibility. Better usability usually means shorter time-to-insight and it is highly-valued by Data Science-driven use cases.

As of today, to my knowledge, there is no widely adopted language which works well with large C/C++ codebases while offering a rapid application development model fit for Data Science. Maybe the closest in this space is Julia.

Relevance

The domain of Data Science has grown significantly over the last decade. The domain has seen 650% job growth since 2012 (as per LinkedIn) and the U.S. Bureau of Labor Statistics estimates 11.5 million new jobs by 2026.

While Data Science has no widely accepted definition, from language design perspective several observations can be relevant:

Many tasks can be expressed as math optimization problems (think of backprop and ML) – most of the time the goal of the program is to put
The field is tightly connected to (not-so-complicated) math –
Performance is critical
Usability is even more critical
Many programmers are no experts in computer science or do not have past programming experience – usually people come from different backgrounds where the focus was generally mastering a field such as statistics, biology, medicine, physics

Opportunities

I decided to reach out as I believe there might be a good opportunity to design a robust, ground-up solution to these non-orthogonal problems at relatively low design cost.

Questions:

I understand that we are at an early development stage of the language and I tried to keep my questions high-level.

Where do Carbon’s design principles stand in the tradeoffs between compatibility with C++, performance, reliability and ease of use?
Is Carbon interested in supporting Data Science-driven use cases?

I am happy to elaborate more if necessary.

Any other information that you want to share?

No response

cc: @zygoloid

FWIW, I hope that Carbon could be a compelling alternative to C++'s usage in data sciences. I think there are use cases in this domain that C++ isn't a good fit for and that Carbon wouldn't be either -- cases where IMO higher level languages like Python and especially Julia or Matlab thrive. I wouldn't want to try to compete with these higher level languages that more directly allow working with high level mathematical constructs. I'm a bit biased, but I find Julia especially compelling in that area. =]

But outside of that, there are definitely cases (even in my limited exposure to the domain) where C++ is an excellent and widely used tool, and where I think we can and should make Carbon an even more compelling option.

I understand that we are at an early development stage of the language and I tried to keep my questions high-level.

Where do Carbon’s design principles stand in the tradeoffs between compatibility with C++, performance, reliability and ease of use?

We've tried to spell this out here: https://github.com/carbon-language/carbon-lang/blob/trunk/docs/project/goals.md#language-goals-and-priorities

Is there something that isn't clear from this?

Is Carbon interested in supporting Data Science-driven use cases?

I may have touched on this above, but I would say for some use cases, this makes a lot of sense. For others, it may not.

Mostly, I think that much of what you describe really benefits from having more than one language, so that different languages can specialize a bit.

I think languages like Julia (and Python and others) do a great job of providing very high level and mathematical working environment for data sciences. It provides amazing usability, especially for researchers, and others. But the performance is still excellent, and mathematical modeling and optimization is powerfully supported.

But there are a lot of places where to implement core pieces of this, a lower-level language that can operate much more closely to the machine is the right call to extract every last inch of performance and capability that the machine has with minimal overhead. I think this is where C++ thrives today in data-sciences, and I think this is an area that Carbon should absolutely make sense for and be a really compelling option. It's also something I think we should invest in on the Carbon side to make sure we have a compelling story.

But I'm hesitant to try to merge these use cases. In both of them, I think the ability of the languages to specialize towards the specific needs of that side of the problem domain are huge, and likely outweigh the costs of having two languages. Instead, I would be more interested in seeing excellent interop between Carbon and these higher level languages to make using them together, each when it is the most appropriate tool for the job. I'm not sure this is something we can prioritize in the short term, but only because we need to get over the C++ interop phase of the project first. Beyond that, I think this would be a really exciting area if anyone wanted to invest and contribute to making things like Julia+Carbon data sciencies workflows really excellent and powerful.

All of this is of course just my two cents. Is this helpful? Are there things I could clarify more here?

Thanks for the quick response.

FWIW, I hope that Carbon could be a compelling alternative to C++'s usage in data sciences. I think there are use cases in this domain that C++ isn't a good fit for and that Carbon wouldn't be either -- cases where IMO higher level languages like Python and especially Julia or Matlab thrive. I wouldn't want to try to compete with these higher level languages that more directly allow working with high level mathematical constructs. I'm a bit biased, but I find Julia especially compelling in that area. =]

But outside of that, there are definitely cases (even in my limited exposure to the domain) where C++ is an excellent and widely used tool, and where I think we can and should make Carbon an even more compelling option.

I understand that we are at an early development stage of the language and I tried to keep my questions high-level.

Where do Carbon’s design principles stand in the tradeoffs between compatibility with C++, performance, reliability and ease of use?

We've tried to spell this out here: https://github.com/carbon-language/carbon-lang/blob/trunk/docs/project/goals.md#language-goals-and-priorities

Is there something that isn't clear from this?

In my experience the HPC-based data science use-cases (or scientific computing in general) put Code that is easy to read, understand, and write possibly in a first place and performance generally is a second consideration. However, probably in 99% of the cases one does not contradict the other.

It seems to me that the design priorities, as written, fit very well to the major requirements for supporting data science/scientific computing. Hence my question here. I assume that little tweaks early on to the language could foster an entire ecosystem supporting data science. I am not asking the language development team to build such an ecosystem but to put the right hooks early on for a community to do so.

Is Carbon interested in supporting Data Science-driven use cases?

I may have touched on this above, but I would say for some use cases, this makes a lot of sense. For others, it may not.

Mostly, I think that much of what you describe really benefits from having more than one language, so that different languages can specialize a bit.

I think languages like Julia (and Python and others) do a great job of providing very high level and mathematical working environment for data sciences. It provides amazing usability, especially for researchers, and others. But the performance is still excellent, and mathematical modeling and optimization is powerfully supported.

Their success is partially that they have good package management integration and reduce the time-to-insight hiding edit-compile-run cycle. I believe I read some plans about package management and there seems to be an interpreter being built. Do you see anything else that's missing in comparison to these high level languages?

But there are a lot of places where to implement core pieces of this, a lower-level language that can operate much more closely to the machine is the right call to extract every last inch of performance and capability that the machine has with minimal overhead. I think this is where C++ thrives today in data-sciences, and I think this is an area that Carbon should absolutely make sense for and be a really compelling option. It's also something I think we should invest in on the Carbon side to make sure we have a compelling story.

IMO, C++ has been missing out not supporting well data science and that's probably due to the lack of quick language evolution as you point out in Carbon's rationale.

But I'm hesitant to try to merge these use cases. In both of them, I think the ability of the languages to specialize towards the specific needs of that side of the problem domain are huge, and likely outweigh the costs of having two languages. Instead, I would be more interested in seeing excellent interop between Carbon and these higher level languages to make using them together, each when it is the most appropriate tool for the job. I'm not sure this is something we can prioritize in the short term, but only because we need to get over the C++ interop phase of the project first. Beyond that, I think this would be a really exciting area if anyone wanted to invest and contribute to making things like Julia+Carbon data sciencies workflows really excellent and powerful.

How do you see the interop between compiled language (which I believe you aim Carbon to be) and a dynamic language such as Julia/Python?

All of this is of course just my two cents. Is this helpful? Are there things I could clarify more here?

Thanks for the clarification. This is really helpful. Do you have any vision for the concurrency language model and concurrent program failover?

Hi, I'm a developer who has used Julia extensively over the past few years. I work for both JuliaHub (formerly known as JuliaComputing) and PumasAI, and I've written and maintain a lot of open source Julia packages. My opinions presented here are purely my own and don't reflect that of anyone else.

A big part of Julia's pitch is that it tries to solve the "two language problem", which is that languages like Python, R, and MATLAB are great for prototyping, but you'll need to rewrite an algorithm in C++ if you want performance.

JIT latency issues aside, I do think Julia already delivers here if you dig deep enough. E.g., SimpleChains.jl was around 9x faster on training MNIST on the CPU than PyTorch leveraging MKL on an AVX512 system, Octavian.jl is competitive with OpenBLAS's assembly kernels, RecursiveFactorization.jl delivers better single threaded lu factorization performance than MKL or OpenBLAS below sizes of around 500x500 and is thus a major contributor to the exceptional performance of Julia stiff ODE solvers (at least for small to medium sized problems). These libraries are all "pure"-Julia re-implementations of code normally written with architecture specific assembly kernels.

However, I think there is another aspect to the two language problem: ergonomics. Dynamic typing is great for prototyping and quick throw away scripts. A non-pedantic language gets out of your way, and lets you quickly write something to get the answer you're looking for and move on with your life.

But, getting out of your way isn't far from "sink or swim". It won't hold your hand, it won't push you towards good practices. If you want to write good Julia, you need extreme discipline and to be highly motivated to do so for its own sake. You are on your own. Languages like C++ or Rust make it harder to knock out quick scripts, but make it easier to develop large, robust systems.

Maybe it's a case of "the grass is always greener", but I have spent months of my time tracking down regressions or performance problems in the ecosystem that simply wouldn't even have been possible in a language like C++ or Rust. As someone maintaining a large ecosystem and developing (proprietary) packages on top of it, I believe the dynamism and laissez faire approach of the Julia compiler have made me less productive, not more.

There are more dimensions to the utility of a tool than merely the runtime performance of the product you get. I think there is a lot that more static languages can offer us.

One thing important to Julia but also I believe data science in general is good support for automatic differentiation. Data science, machine learning, statistics all rely heavily on automatic differentiation, from optimization, estimating standard errors, to integration via sampling like of Hamiltonian Monte Carlo or approximations such as Laplace's method. We make heavy use of all of these.

Perhaps this is a problem EnzymeAD can eventually solve. It's an automatic differentiation library that works on LLVM IR. Ideally, I could write code in Carbon (or C++, etc), and a Julia library could still differentiate it with minimal boiler plate. That'd of course require specifying which derivatives we need to support in advance (which IMO is not so bad), having access to the carbon compiler with JIT support (to JIT the needed code), or at least access to the generated LLVM IR (so Julia's JIT can handle it).

Catching up on some of the unanswered questions below, but I also want to suggest that we should not spend too much time debating between dynamic vs. static languages. Given the goals and direction of Carbon (specifically w.r.t. C++ successor), it seems clear where it is aiming.

Beyond answering some high-level questions below, it's not clear that there is a specific decision being requested of the leads here? If not, I might move this to the discussion forum instead of an issue.

Do you see anything else that's missing in comparison to these high level languages?

Nothing jumps to mind, but to an extent I think this will need to be revisited when we're further along and have more concrete things to look at in this space.

But I'm hesitant to try to merge these use cases. In both of them, I think the ability of the languages to specialize towards the specific needs of that side of the problem domain are huge, and likely outweigh the costs of having two languages. Instead, I would be more interested in seeing excellent interop between Carbon and these higher level languages to make using them together, each when it is the most appropriate tool for the job. I'm not sure this is something we can prioritize in the short term, but only because we need to get over the C++ interop phase of the project first. Beyond that, I think this would be a really exciting area if anyone wanted to invest and contribute to making things like Julia+Carbon data sciencies workflows really excellent and powerful.

How do you see the interop between compiled language (which I believe you aim Carbon to be) and a dynamic language such as Julia/Python?

Nothing concrete this early on, but I think it's definitely an area to invest time on when things are more concrete and there are users with specific needs and use cases in this space.

Do you have any vision for the concurrency language model and concurrent program failover?

Not yet, beyond needing to stay somewhat close to C++ for the sake of interop. For example, I think we'll need to share the core memory model. But that leaves a lot of room. =D

Beyond that, all I can do is guess (and only guess!) at scheduling and sequencing here: I think we'll likely look at pinning down concurrency as part of the very next iteration beyond the most basic MVP we have at 0.1. So I would guess that's when we'll be able to start getting much more concrete here.

IMHO, it might be of great help if Carbon support generic n-dimensional tensor data type officially. Since almost all scientific problems are addressed with math and high dimensional data structure. Take myself as an example. I am not a IT engineer. My colleagues and I are not good at programming. But we have to use C/C++ to write programs to simulate with Monte Carlo method to evaluate the performance of the algorithm which is implemented by us to solve the scientific problem. Unfortunately C++ doesn't support tensor data at the moment, which means if we want to add two 3-dimensional arrays we need to write three nested for loop like: for(int i; i<L; i++) for(int j; j<M; j++) for(int k; k<N; k++) c[i][j][k] = a[i][j][k] + b[i][j][k]; We do have another choice that's to define my own tensor data type and overload all operators and all functions in head file. And there are also third party libraries of this kind. However we are not professional in programming, if there are some potential bugs with those libraries, we are not able to fix them, so we keep writing those nested-for-loops for correctness. So we really expect tensor data type be supported by the language officially. But the high level scientific functions might be less necessary. Maybe those functions can be implemented in domain specific packages in the future.

One thing to keep in mind is that in Carbon, our intent is that all APIs are library APIs -- even integers will be defined as library types rather than being built directly into the language, so the same will presumably apply to tensors. That being the case, the question of whether tensors are "officially supported" seems like it's more about resource allocation and support commitments than about language or library design. If so, it's probably too early to usefully discuss that -- even Carbon itself is years away from being "officially supported".

Note that C++23 adds mdspan as non-owning multidimensional arrays. Of course, there are also plenty of libraries available.

IMHO, multidimentional arrays are a key selling point to HPC folks. C++ does not support them (mdspan does not count) and this is a major adoption barrier in the HPC community. Of course this is not a priority for Carbon at this point which completely makes sense, but I do believe that planning on supporting them at some point is mandatory if the scientific community is not to be left behind. I am not yet familiar with all the implications of the 'all APIs are library APIs' principle, but on the top of my mind, the slicing operator array[i, j:j+4, k, :] is something that Python, Julia, Matlab and Fortran all support and I think that leaving the door open for such constructs in the future is a design decision that must be made quite early.

Regarding multidimensional arrays as well as slicing, there's an intent to support (and work to keep the door open in syntax). It's important and there have been a couple attempts but it's an area that needs significant design time, and it's not something that's immediately on the roadmap for 0.1, which is where we're focused right now.

Talked with other leads, and we're going to close this for now -- basically matching the comments here and here.

We're very much interested in supporting these use cases, and the specific areas that have been surfaced (concurrency, parallelism, slicing and multi-dimensional data) are on our radar and areas we'd like to see investment in, but we want to wrap up our roadmap for 0.1 first, and so probably are not going to see focused effort on in the short term.

carbon-language / carbon-lang