LOST-STATS / lost-stats.github.io

Source code for the Library of Statistical Techniques
https://lost-stats.github.io/
GNU General Public License v2.0
263 stars 167 forks source link

Update collapse_a_data_set.md #119

Closed grantmcdermott closed 3 years ago

grantmcdermott commented 3 years ago

Simplifies the intro explanation (to avoid redundancy in each implementation).

Adds a Julia implementation.

Adds data.table and collapse to the R implementation. (Also updates the dplyr example.)

Updates the Stata examples to use the same dataset. (@clibassi and @NickCH-K please check this for me. I'm pretty sure I'm using the correct syntax from memory, but my Stata installation is giving me issues so I can't check.)

NickCH-K commented 3 years ago

Works for me!

clibassi commented 3 years ago

Yep - looks good to me too. Sorry for being slow on this - just had a busy afternoon yesterday.

@grantmcdermott - I was going to add some Julia code to other pages, do you think we should be specifying the versions of the packages used since syntax seems to change somewhat quickly in the Julia ecosystem at this point?

khwilson commented 3 years ago

As a side note, I never merged it, but there's a commit that will let you test all code sample. I think it currently skips stata b/c it requires having stata installed, but if you'd like, I'm happy to try to get this merged. :-)

NickCH-K commented 3 years ago

@khwilson that sounds pretty cool to me! I assume there's a way to set certain chunks not to be tested?

khwilson commented 3 years ago

Yep! you can label them skip=True :)

On Mon, May 3, 2021 at 11:52 AM NickCH-K @.***> wrote:

@khwilson https://github.com/khwilson that sounds pretty cool to me! I assume there's a way to set certain chunks not to be tested?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LOST-STATS/lost-stats.github.io/pull/119#issuecomment-831355160, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALU5EVSETL462CF756ZOV3TL3BCXANCNFSM44AHOGLA .

NickCH-K commented 3 years ago

Then yeah let's do it.

I imagine some code will not run-will the site still build? And how will we know which chunks aren't running so they can be fixed?

khwilson commented 3 years ago

Yeah, so my proposal would be to have, say, a monthly job that runs everything but the skip=Trues and then produces a report. You could also run it by hand from time to time. (Soon it might be time to talk about hackathons for LOST Stats a la Sage Days https://wiki.sagemath.org/days112. They even get funding!)

Note that there's some danger here because you're letting people run arbitrary code blocks, but since they've all been vetted by someone with merge privileges (and I generally trust all of us not to approve a virus :-) ), this should be fine.

On Mon, May 3, 2021 at 12:05 PM NickCH-K @.***> wrote:

Then yeah let's do it.

I imagine some code will not run-will the site still build? And how will we know which chunks aren't running so they can be fixed?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LOST-STATS/lost-stats.github.io/pull/119#issuecomment-831363555, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALU5ERCRGRLGXY4W7GZABDTL3CWHANCNFSM44AHOGLA .

NickCH-K commented 3 years ago

A LOST hackathon is a great idea!

And yeah I'm not worried about virus code (unless one of the approved contributors has been getting sneaky!). I do expect some chunks to fail, if only for package-install reasons or chunks not meant to be run

clibassi commented 3 years ago

+1 to a LOST hackathon!

grantmcdermott commented 3 years ago
  1. @khwilson I reckon go for it!

  2. @clibassi Hmmm, good question RE versions. Our general approach has been "no" unless it requires a dev install/version. With specific regard to DataFrames.jl, I feel pretty comfortable that we can leave version info out following the release of v1.0 and the stable code base. Also, I'm sure you've seen, but @bkamins and co. have put together a great comparison doc across languages here. We wouldn't need to go as in-depth here at LOST; I think our focus is generally on simple applications (potentially using real-life datasets) that get people up and running as soon as possible.

  3. Another +1 to a LOST hackathon (although only once all my deadlines are met...)

bkamins commented 3 years ago

Soon we will also have a comparison to data.table (it is just being finished) - if this is something that would be useful for you.

khwilson commented 3 years ago

@grantmcdermott and @clibassi the reason I paused on the tester in the first place was the version control issues. FWIW, I've been maintaining the list of packages that LOST uses here, though it's very out of date.

While the policy of LOST has been "no specific versions," the list of "we tested against" could be there.