Initial design meeting - Githubissues

codedthinking / Kezdi.jl

An umbrella of Julia packages for data analysis, in loving memory of Gábor Kézdi

MIT License

9 stars 0 forks source link

Initial design meeting #16

Closed korenmiklos closed 1 month ago

korenmiklos commented 2 months ago

With @gergelyattilakiss we are building a Julia package to help typical applied economics workflows for data cleaning, as well as exploratory and regression analysis. The syntax follows that of Stata (R), a statistical tool widely used by economists. Our work is inspired by https://github.com/TidierOrg/Tidier.jl and https://github.com/jmboehm/Douglass.jl.

The motivation is to help more economists adopt Julia, giving them a performant scientific computing language that they can use not only for macroeconomics simulations, but also applied microeconomics data work.

We have some ideas for the design of the tool (see code example below). We are also analyzing Stata code produced by economists, as submitted to journals, to study the patterns of coding and the most frequent commands used.

Please join us for a design meeting to discuss what such a tool should and should not do. If you have strong views about either Julia, Stata or applied economies, please come and share.

If you can join for a 2-hour Zoom meeting at the end of April, please let us know in a comment below.

@chain df begin
    @keep country_code year gdp population
    @generate log_gdp = log(gdp)
    @generate log_population = log(population)
    @egen min_log_gdp = min(log_gdp)
    @replace log_gdp = min_log_gdp @if missing(log_gdp)
    @collapse log_gdp log_population, by(country_code)
    @regress log_gdp log_population, robust
end

korenmiklos commented 2 months ago

Can you join @kdpsingh, @jmboehm, @cpfiffer, @tpapp? It would be great to hear your thoughts on this!

tpapp commented 2 months ago

@korenmiklos, have you looked at Query.jl, DataFramesMeta.jl, and SplitApplyCombine.jl? What are you missing from these (and Douglass.jl) that requires a new package?

floswald commented 2 months ago

I feel a bit like the uninvited guest who comes to spoil the party - sorry! Anyway, my aim is not to spoil the party. I've been talking many times with @jmboehm about this and I think the question of @tpapp is exactly spot on. What exactly is missing, and what's peculiar about applied economist's workflows here? Here is a plain vanilla DataFrames.jl pipeline. @pdeffebach could probably chip in some useful bits from DataFramesMeta.jl.


using CSV
using DataFrames
using GLM
using Chain
using Statistics

gapminder() = CSV.read(download("https://vincentarelbundock.github.io/Rdatasets/csv/gapminder/gapminder.csv"), 
                    DataFrame)

function pipeline(d::DataFrame)
    @chain d begin
        select(:country, :year, :gdpPercap, :pop)
        transform([:pop, :gdpPercap] .=> (x -> log.(x)) .=> [:logpop, :loggdpPercap])
        transform(:loggdpPercap => (x -> replace(x, missing => minimum(skipmissing(x))) => :loggdpPercap))
        groupby(:country)
        combine([:logpop,:loggdpPercap] .=> mean .=> [:logpop,:loggdpPercap])
        lm(@formula(loggdpPercap ~ logpop), _)
    end
end

function run()
    d = gapminder()
    pipeline(d)
end

run()

StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

loggdpPercap ~ 1 + logpop

Coefficients:
───────────────────────────────────────────────────────────────────────────
                   Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────────
(Intercept)   8.21888      1.00019     8.22    <1e-12   6.24144   10.1963
logpop       -0.00381135   0.0631324  -0.06    0.9519  -0.128627   0.121005
───────────────────────────────────────────────────────────────────────────

maiaguell commented 2 months ago

yes, please!! I would love to join and thanks for the initiative!

kdpsingh commented 2 months ago

@korenmiklos, thanks for including me. I don't have the bandwidth this month to join a 2-hour Zoom mainly because I have a lot of travel coming up this month and next month.

I agree with the sentiment that it's worth figuring out the value proposition, but I think that's generally true of all new packages and shouldn't stop you from experimenting.

A great example of this is the finalfit package in R. Even though regression is baked into R, finalfit provides a unifying interface to fixed effects and mixed effects models as well as publication-ready tables. Similarly, I would think through where you have friction in your workflow and would prioritize those things for Kezdi.

pdeffebach commented 2 months ago

Glad you are trying to expand applied micro-economics uses in Julia!

I agree with the above commentators that the current data cleaning ecosystem has very good "bones". I don't think we necessarily need a new data cleaning package. Here is the current pipe you have written

julia> df = CSV.read(download("https://vincentarelbundock.github.io/Rdatasets/csv/gapminder/gapminder.csv"), DataFrame);

julia> m = @chain df begin
           @select :country :year :gdpPercap :pop
           @rtransform begin
               :logpop = log(:pop)
               :logdpPercap = log(:gdpPercap)
           end
           @by :country begin
               :logpop = mean(:logpop)
               :loggdpPercap = mean(:logdpPercap)
           end
           lm(@formula(loggdpPercap ~ logpop), _)
       end;

A few differences from above

We explicitly write mean(:loggdpPercap) instead f @collapse automatically using the mean. I view explicitly saying mean as a good thing. To avoid writing mean twice we could also do [:logpop, :loggdpPercap] .=> mean which seems okay.
We use :x instead of x to refer to variable names. This is intentional! A major hassel in dplyr is that what x means depends on if a column exists in a data frame or not. Being able to visually distinguish local variables and columns is a major plus. Stata doesn't have this problem, of course. But Julia is a "real" programming language, unlike Stata. With greater flexibility means more syntax to alleviate confusion. Note that you can also use variable names programatically in DataFramesMeta.jl via $.
We have two versions of macros, @rtransform and @transform. The first is for row-wise operations, and the second for column-wise. This is somewhat akin to egen vs gen in Stata.
Side note: You may be interested in the recent @label and @note macros introduced in DataFramesMeta.jl, to emulate Stata's metadata features.

This is to say: Maybe the syntax is occasionally more complicated in DataFramesMeta.jl. But the differences are the result of real tradeoffs. I don't think re-writing a new data-cleaning library is worth it at the moment.

What should be done instead

However I think new developers devoted to micro-economics is a great idea! The number one issue I would like to see is for our statistics and regression packages. Currently, we have three main regression packages.

FixedEffectModels.jl, maintained by @matthieugomez . Matthieu is a busy AP at Columbia and I don't think he has time to maintain this complicated package in a way to make it as robust as, say, fixest in R.
Econometrics.jl, maintained by @Nosferican, who is working at the Fed and is likewise very busy
GLM.jl, which aims to be a pretty minimal packages, emulating Base R's glm and lm without the features economists need.

Additionally, the HypothesisTests.jl and StatsBase.jl packages don't really have maintainers anymore. @andreasnoack occasionally reviews PRs but there are still bugs to be fixed. Additionally, StatsBase.jl could use much more support accommodating missing values and PRs would (probably) be welcomed. (Missing values are contentious for a variety of reasons, and probably not a good place for a newcomer to start working on PRs).

As for other estimation tools

There is nascent work on GMM with GMMTools.jl, again run by an AP at an R1 university. I think Gabriel would appreciate a summer intern who can contribute to this estimation.
We don't have a good DiD package similar to did in R.
@nilshg has a SynthControl.jl package for Synthetic controls. Maybe someone can help with the development of that package.
We have no package for marginal effects, certainly not as good as the excellent marginaleffects package in R.

So my take is that if a micro-economist wants to use Julia, data cleaning is not the issue. There are some data cleaning things that would make life easier. Two things I think about are (1) Better missing values support (which is kind of in my court, I have some PRs that can improve things) and (2) Better data viewing. Stata's viewer is excellent, but it shouldn't be impossible to write a QT-based data viewer.

These are marginal, however, compared to the minuscule size of the Julia statistics and econometrics ecosystem compared to R and Stata. That's where the energy needs to be.

korenmiklos commented 2 months ago

Thanks all!

My purpose with this package is as follows. Stata is doing something well. A large chunk of applied micro work is in Stata (we will have precise numbers by end of April). If we want applied microeconomists to use Julia, we need to offer something that is as easy to use as Stata and easy to switch to. Julia has great existing tools, but they don't fill this gap.

@pdeffebach: I will look more into these packages. I agree there is value to be added there. Let's get more users, bigger communities, maybe it will help build the more innovative packages, too.

korenmiklos commented 2 months ago

From @gbekes

Oh you can guess what I'll say.

Why not base it on #fixest in R. It's comprehensive. It's the basis of PyFixest? lrberge.github.io/fixest/
In terms of use cases, what could be better than github.com/gabors-data-an…
Also check out LOST lost-stats.github.io/Model_Estimati…

gbekes commented 2 months ago

Indeed. And let me tag Laurent @lrberge for R fixest and Alex @s3alfisc for his Python version PyFixest. . Lemme also tag @vincentarelbundock for modelsummary

My strong view is that the fixest way -- a regression wrapper with a clear stance + flexibility is the way ahead, and the same syntax should work in all languages.

matthieugomez commented 2 months ago

I think FixedEffectModels is pretty good and flexible (should be a drop-in replacement for reghdfe). please file issues if you encounter problems or think of important missing functionalities

On Tue, Apr 9, 2024 at 10:42 AM Gábor Békés @.***> wrote:

Indeed. And let me tag Laurent @lrberge https://github.com/lrberge for R fixest https://lrberge.github.io/fixest/ and Alex @s3alfisc https://github.com/s3alfisc for his Python version PyFixest https://github.com/s3alfisc/pyfixest. . Lemme also tag @vincentarelbundock https://github.com/vincentarelbundock for modelsummary https://modelsummary.com/

My strong view is that the fixest way -- a regression wrapper with a clear stance + flexibility is the way ahead, and the same syntax should work in all languages.

— Reply to this email directly, view it on GitHub https://github.com/codedthinking/Kezdi.jl/issues/16#issuecomment-2045355735, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABPPPXKWKDCLSSWOJLOQQODY4P44PAVCNFSM6AAAAABF55NSAOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBVGM2TKNZTGU . You are receiving this because you were mentioned.Message ID: @.***>

jmboehm commented 2 months ago

Thanks Miklos for tagging me.

As discussed with some of you, I feel that Julia would be able to bridge the two-language problem of economics which is that (1) data cleaning and reduced-form work is much simpler in Stata/R; (2) Structural work is much easier/faster in Python/C/Matlab/Julia. Karandeep has done terrific work on bringing R's tidyverse to Julia, but I'm not a R user so I still do my reduced-form work mostly in Stata.

My goal for starting Douglass.jl was to have a package that can take Stata code almost one-for-one and generate Julia code (DataFrames.jl/DataFramesMeta.jl). My difficulty when using Julia's tabular data packages is exactly that I have to think very carefully about how missing values propagate, and I'd prefer to have exactly the behavior of Stata (that's somehow hardwired into my brain). I haven't had as much time for free software development as I would have liked, so I haven't taken this all the way to the end and developed a polished package. Perhaps, if I get a large grant one day, I'll get a student to spend a summer expanding and polishing the package. But as always the cost is in the maintenance, not in the initial development. For that you need to get a critical mass of users and developers.

Tbh I don't see a point in developing a package that departs significantly from existing syntax or behavior: the lower the switching costs, the better. I share @pdeffebach's view that missing values are a problem (well, at least for me), and that data viewing is not as smooth as in e.g. Stata.

I'm travelling a lot for seminars at the end of the month, so not sure I can join, but keep me in the loop about the date/time.

grantmcdermott commented 2 months ago

Someone just forwarded me this link, so please excuse another uninvited house guest.

@korenmiklos it’s your package and your prerogative, but I’d just like to plus 1 all the comments about caution against reinventing the wheel here and introducing yet another syntax to Julia. DataFrames.jl and derivatives are all very well developed now. If you do want another frontend, then it makes much, much more sense to mimic the tidyverse API than Stata’s, as @kdpsingh has already done. I mean this both in terms of numbers of absolute users, econ or not, and also the clear influence that the tidyverse has had on other languages and packages from Ibis to Polars to Tidier.jl etc.

What is clearly missing from Julia’s applied micro toolset IMO has already been mentioned above: marginaleffects, table-writing (at least that is as good as modelsummary & co.), inconsistent missings treatment, hypothesis testing and vcov adjustments, etc. Moreover, quite a lot of the Julia universe still suffers from poor documentation. I wrote something to this effect on the Julia Forums back in 2021 and I still feel like this is where the biggest bang for your buck is going to be. At the same time, you have to make a value proposition for why an applied economist should learn Julia instead of say R (or even Python). And I think the sales pitch here is much harder, since you are ignoring some of Julia’s obvious advantages (which are really felt in structural and macro).

droodman commented 2 months ago

I'm curious to join. I'm in DC, but will be in California April 29-30.

I posted a working paper yesterday about using Julia as a back end for Stata and other environments, the motivating example being reghdfejl for Stata, which wraps FixedEffectModels.jl. I think the underlying julia package for Stata is getting pretty solid. I wouldn't be surprised if, say, Stata version 20 officially supports Julia.

I agree with the comments above that while confusing syntax is sometimes a barrier (contrasts=Dict(:v1=>DummyCoding()) instead of i.v1 for non-CategoricalVectors), and I appreciate the efforts to overcome that, bigger issues are poor documentation and lack of basic features like a data viewer and common regression and inference methods. It sounds like there are issues with missing. And I have seen no language for expressing hypotheses. (Stata has a standard for expressing hypotheses and constraints, especially linear ones.)

I wonder about institutional vs technical solutions. E.g., can Julia establish a documentation standard with incentives for compliance, whereby it would become normal for package developers to document all the functions and all their options in one place?! Another example of an institutional solution: Stata created the Stata Journal to reward academics for contributing software, in a currency academics appreciate, publication. It also has a mechanism for generating top-10 lists of most-downloaded packages, from SSC anyway.

I wonder if it matters more what would make developers switch to Julia than what would make regular users switch. If I'm a young econometrician with a clever new method, I'm probably going to implement in R first and maybe Stata and/or Matlab too, for obvious reasons. My new paper makes the case for Julia as a universal back end development environment. Maybe back ends are the back door to getting Julia used more as a front end too, in the long run? If the cores of useful, well-maintained packages are already in Julia, polishing the user experience becomes easier.

There's also the issue of money. IIRC NumPy and the like were solidified with public or private grant funds, on the idea that those packages are are global public goods. Chan-Zuckerberg has given grants for Julia work. To get $ for stats work in Julia, one would probably need to make the case for its distinctive potential to serve users of many software platforms.

korenmiklos commented 2 months ago

@andrasvereckei

korenmiklos commented 2 months ago

I sent out invites for the meeting next Thursday, 2-4pm Budapest time. If you want to join but did not get an invite, please reply to this post.

All inputs are super useful and will be taken into account. I will also share the statistical analysis we are currently doing and the outcome of the design discussion here.

droodman commented 2 months ago

I would be interested. Thank you. @. @.> .

From: Miklós Koren @.> Sent: Thursday, April 25, 2024 4:07 PM To: codedthinking/Kezdi.jl @.> Cc: droodman @.>; Comment @.> Subject: Re: [codedthinking/Kezdi.jl] Initial design meeting (Issue #16)

I sent out invites for the meeting next Thursday, 2-4pm Budapest time. If you want to join but did not get an invite, please reply to this post.

All inputs are super useful and will be taken into account. I will also share the statistical analysis we are currently doing and the outcome of the design discussion here.

— Reply to this email directly, view it on GitHub https://github.com/codedthinking/Kezdi.jl/issues/16#issuecomment-2078084564 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AGB2Z2KMEAACUD4PVJ4YCD3Y7FO55AVCNFSM6AAAAABF55NSAOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZYGA4DINJWGQ . You are receiving this because you commented. https://github.com/notifications/beacon/AGB2Z2MGPPI4W2SWKL2KPZ3Y7FO55A5CNFSM6AAAAABF55NSAOWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTT33UG5I.gif Message ID: @. @.> >

maiaguell commented 2 months ago

Hi Miklos, I did not get the invite!

Maia

On Thu, Apr 25, 2024 at 10:08 PM Miklós Koren @.***> wrote:

This email was sent to you by someone outside the University. You should only click on links or attachments if you are certain that the email is genuine and the content is safe.

I sent out invites for the meeting next Thursday, 2-4pm Budapest time. If you want to join but did not get an invite, please reply to this post.

All inputs are super useful and will be taken into account. I will also share the statistical analysis we are currently doing and the outcome of the design discussion here.

— Reply to this email directly, view it on GitHub https://github.com/codedthinking/Kezdi.jl/issues/16#issuecomment-2078084564, or unsubscribe https://github.com/notifications/unsubscribe-auth/AK7ETGFM4IKUK3POGPZKRE3Y7FO55AVCNFSM6AAAAABF55NSAOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZYGA4DINJWGQ . You are receiving this because you commented.Message ID: @.***> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.

pdeffebach commented 2 months ago

I did not get the invite, either, but I can try and hop on the call

pdeffebach commented 1 month ago

A apologize for the abrupt and rude entry to the meeting.

I wanted to clarify the parsing rules situation. You are correct that if youxxx @when yyy parses as a tuple. I apologize for being dismissive. I was wrong.

The issue I was recalling arises when expressions are in a begin block. I think julia wants either a new line or an end, but not a multiple expressions on the same line.

julia> macro rsubset(args...)
           for arg in args
               dump(arg)
           end
       end;

julia> macro when(args...)
           nothing
       end;

julia> @rsubset begin
           :y = 1 + 2 @when :z == 1
       end
ERROR: ParseError:
# Error @ REPL[23]:2:16
@rsubset begin
    :y = 1 + 2 @when :z == 1
#              └───────────┘ ── Expected `end`
Stacktrace:
 [1] top-level scope
   @ none:1

without the begin block things work fine

julia> @rsubset :y = 1 + 2 @when :z == 1

DataFramesMeta.jl uses begin ... end for many transformations that get passed to the same transform call. Making transformations on their own line with @gen might solve this problem, but might also cause issues inside a @chain block. Having multiple transformations withing a begin ... end block also benefits leveraging DataFrames.jl's performance features for many transformations at once (which get multithreaded). In contrast to something like Tidier.jl, DataFramesMeta.jl tries to adhere closely to the DataFrames.jl API.

I don't want to discourage you from a new data manipulation library. I think it's good for the ecosystem to have many iterations on the same overall design.

A Wald Test package or Marginal Effects package, that has a "Julian" API (interacts nicely with the StatsModels API, for example) would be of broad interest to the community. I think a focus on a Stata-like API for running regressions runs two risks.

It does not have enough features, maybe it's easy to run OLS but hard to run high-dimensional fixed effects, or a Wald Test, marginal effects, etc.
Features are reliant on macros or idiosyncratic API specifications, making it hard for other packages to leverage any innovations and hard to make things work programatically and "at scale".

Again, I apologize for my rudeness this morning.

Peter

droodman commented 1 month ago

I'll just add that WildBootTests.jl can do non-bootstrapped Wald and score tests after OLS, IV, and even ML estimation. However the interface is low-level. You express a linear hypothesis, or set of hypotheses, Rb=r by passing R and r. And you tell it the model, and in the case of ML the estimation result, in a similarly low-level way rather than passing a fitted reg() result. That's OK when using it as a back end in Stata or R. I intend to make it accept fitted Julia regression results. But I can't make it accept hypotheses in a nicer way until a formula-like language for expressing them is developed.

korenmiklos commented 1 month ago

No worries, @pdeffebach, thanks for joining and sharing your thoughts.

The command ... if condition syntax is so ubiquitous in Stata, but also so helpful, I want to allow for this. Every Stata command begins with a reserved word, which we can use to parse @when (or @where) statements in the rest of the expression.

I will look into the other stats packages mentioned.

Tomorrow I will create+share a summary of this meeting and close this issue.

korenmiklos commented 1 month ago

Thanks @droodman, @maiaguell, @gergelyattilakiss, @floswald, @jmboehm, @pdeffebach, @andrasvereckei for the productive meeting. My summary notes, without implying that you agree with all this.

I am closing this, but we can continue the discussion under individual issues related to design.

State of the art

The Stata universe

Most applied micreconomists use Stata (70% of REStud packages, followed by Matlab 50%, and R 16%)
Often combined with another language (e.g. Python for cleaning, Matlab for simulation)
Vast majority (88%) of Stata scripts are devoted to data cleaning

The Julia universe

DataFrames.jl de facto standard for tabular data
Many grammars for data cleaning: Query, DataFramesMeta, TidierData

Broad goal:

Port Stata syntax and tools to Julia, like Tidier.jl did tidyverse.

Key tradeoff:

Users like convenience and sensible default choices. But explicit, verbose software is less bug prone.

Be mindful of trade-off throughout the project. Maybe the user can calibrate their level of risk tolerance.

Missing pieces in the Julia data universe

Missing values
1. Stata has common sense defaults
  1. also some quirky behavior, like . > anything
2. Risky choices, make them explicit
3. Input/output (how to read and write missing values) vs algebra (what is 4 + missing?)
4. Type conversion is a pain, cannot put a missing into a vector of Floats
Better documentation for existing packages
Maintainers, curation for existing regression packages
Wald test
1. formula language for linear constraints
2. test gender == schooling + 5
ML estimation package
1. standard errors, clustering
2. regtables

Best of Stata

replace y = 0 if y < 0
regress y x if x > 0

contrasted with much harder syntax in Pandas, R, Julia.

if can be used with almost all commands. Convenient and verbose, no trade-off here. This feature should be implemented if at all possible.

Sensible default choices for missing values.

By default, operations are on variables (columns, vectors).

Opinion: variable scoping is interesting.

scalar n_users = 5
generate y = n_users + 1
replace y = . if y < n_users

BUT can lead to dangerous bugs:

scalar y = 5
generate y = y + 1

Contrasted with some existing grammars

explicitly refer to a df column, df.x, df[!, :x]
refer to symbol, like :x, :y or strings, "x" "y"
TidierData does it well, i.e., most like Stata

Explicit merge m:1 vs merge 1:1

by x: egen z = sum(1)

Value labels are different for categorical vectors.

BUT: no strings as factors
in Stata, variables don't have coding, i.gender and c.gender can be in the same regression
i. notation, changing the base, subset of categories
Not so good in Stata
quirky syntax, like egen vs collapse

no proper function returns

Code examples

using TidierData
@chain data begin
@select command canonical_form
@filter canonical_form == "generate"
@group_by command
@summarize n = n()
@ungroup
@arrange desc(n)
end

using Kezdi
@chain data begin
@keep command canonical_form
@keep @if canonical_form == "generate"
@egen n = count(), by(canonical_form)
@sort -n
end

replace y = . if y < 5

const n_users = 5
model_object = @chain data begin
    @replace y = 0 @if y < 0
    @regress y x @if x > n_users, vce(cluster country)
end

const n_users = 5
@chain data begin
    @replace y = 0 @if y < 0
    @aside model_object = @regress y x @if x > n_users, vce(cluster country)
    @keep @if x < 0 
end

codedthinking / Kezdi.jl

Initial design meeting #16

What should be done instead

State of the art

The Stata universe

The Julia universe

Missing pieces in the Julia data universe

Best of Stata

Not so good in Stata

Code examples