JuliaHEP / JuliaHEP-2023

Materials for the JuliaHEP 2023 Workshop
https://juliahep.github.io/JuliaHEP-2023/
Creative Commons Attribution 4.0 International
4 stars 4 forks source link

Add DataFrames tutorial #19

Closed graeme-a-stewart closed 10 months ago

graeme-a-stewart commented 10 months ago

First version of a data frames tutorial, covering loading, selecting, modifying and deriving data How to handle "missing" values Some pretty plotting to finish off to show visualisation

Use a reduced Higgs dataset with 50k events (as the full dataset is very large and not needed)

github-actions[bot] commented 10 months ago

PR Preview Action v1.4.4 :---: :rocket: Deployed preview to https://JuliaHEP.github.io/JuliaHEP-2023/pr-preview/pr-19/ on branch gh-pages at 2023-11-03 07:52 UTC

Moelf commented 10 months ago

at which tutorial will this part actually happen? Monday?

graeme-a-stewart commented 10 months ago

at which tutorial will this part actually happen? Monday?

Yep. @aoanla and I decided that we will cover the Julia basics in the first half, then cover a few other things in the second part - at least Plots and DataFrames.

Moelf commented 10 months ago

you might want to introduce FHist and more advanced histogramming: https://moelf.github.io/FHist.jl/dev/notebooks/makie_plotting/

aoanla commented 10 months ago

you might want to introduce FHist and more advanced histogramming: https://moelf.github.io/FHist.jl/dev/notebooks/makie_plotting/

We decided to introduce Plots.jl not Maki-e (because @graeme-a-stewart believes that Plots is easier to get set up with and produces "better plots by default") - although I do mention Makie at the end of the Plots overview. Do I gather that this means you're a vote for Makie instead?

Moelf commented 10 months ago

Plots.jl is unlikely to produce publication-ready plots when you have realistic HEP histograms, see issues in Plots.jl: 2007, 2445, 4709, 4206.

Especially when logscale and histograms are involved. Besides, you can't get stacked histogram or hatched lines in Plots.jl


I would say it's fine to demonstrate Plots.histogram(), but in realistic HEP application, you're more likely to need FHist.jl histogram object first, instead of directly plotting from a data vector

aoanla commented 10 months ago

So, there's definitely not enough time to cover Plots.jl 's feature set and Makie.jl so we need to pick one. I'm perfectly happy to write an intro to either [and mention the other at the end] - I also mildly prefer Makie for the extra statistical plots and features [and I like the Observables system for live plot updates], but Plots.jl + StatsPlots is what Graeme's using here because we already discussed this once, over in issue #16 .... Since the plotting stuff leads into @graeme-a-stewart 's DataFrames stuff here at the end, it really does need us all to agree on this before we make people change things.

graeme-a-stewart commented 10 months ago

Yeah, Makie.jl is nice, but this is a short introduction, so I would settle for using Plots.jl (even directly on the vectors) and giving people pointers to more advanced techniques.

graeme-a-stewart commented 10 months ago

On reflection, I realise that my experience with Makie is a bit limited and I may have been put off by a bad first encounter with it (I'm even hazy on what was not good). If you both feel that it's the better option, then I don't mind trying to use it.

Does it have as good interfacing to DataFrames though? i.e., the @df macro.

Moelf commented 10 months ago

the df macro is really limited to in memory data size I don't think it can be used in any serious capacity

aoanla commented 10 months ago

The usual Makie way to deal with DataFrames is either via a recipe (a bit advanced) or via Algebra of Graphics (which is a bit better but a whole tutorial in itself).

On Thu, 2 Nov 2023, 18:21 Jerry Ling, @.***> wrote:

the df macro is really hacky I don't think it can be used in any serious capacity

— Reply to this email directly, view it on GitHub https://github.com/JuliaHEP/JuliaHEP-2023/pull/19#issuecomment-1791298573, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNRWLO2UMM7YW2ZYXNFDELYCPQC5AVCNFSM6AAAAAA6264EC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJRGI4TQNJXGM . You are receiving this because you were mentioned.Message ID: @.***>

graeme-a-stewart commented 10 months ago

OK, this starts to look too complicated to fit into the short time that we have, so I won't change what's there now.

@aoanla, as my plotting part is so self-contained, if you wanted to orient your material towards Makie, I think it would be ok. I assume we have a fairly capable audeince ;-)

@Moelf, in-memory size is rather a problem for all DataFrames implementations that I know!

aoanla commented 10 months ago

Ok, I will make something work.

On Fri, 3 Nov 2023, 07:27 Graeme A Stewart, @.***> wrote:

OK, this starts to look too complicated to fit into the short time that we have, so I won't change what's there now.

@aoanla https://github.com/aoanla, as my plotting part is so self-contained, if you wanted to orient your material towards Makie, I think it would be ok. I assume we have a fairly capable audeince ;-)

@Moelf https://github.com/Moelf, in-memory size is rather a problem for all DataFrames implementations that I know!

— Reply to this email directly, view it on GitHub https://github.com/JuliaHEP/JuliaHEP-2023/pull/19#issuecomment-1791983463, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNRWLLI6TMUJMKBVHNIJVLYCSMFBAVCNFSM6AAAAAA6264EC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJRHE4DGNBWGM . You are receiving this because you were mentioned.Message ID: @.***>

Moelf commented 10 months ago

https://indico.cern.ch/event/1292759/contributions/5618594/

I just realized there's a long talk on this haha, so yeah this is gonna be useful

aoanla commented 10 months ago

Are we okay to merge this (I am happy with what we have here)