2024 Data Update - Githubissues

cdalzell commented 4 months ago

This year's Lahman data update has been released which means it's time to update this package accordingly.

After an initial look at the CSV data files it looks like we may have some import changes or cleanup to do:

People & Parks Data Encoding: It looks like a couple dozen player names and a couple of park names have UTF-8 "replacement characters in them (EF BF BD).
Column Name Changes & Additions. Here's a quick list of some of the things I've noticed:
- Some CSV files now have an additional ID field
- Some columns have case mismatches (yearid vs yearID)
- Some columns have formatting changes (parkkey vs park.key)
- Some columns are appearing places they weren't before (G_batting is now in Batting and Appearances instead of just Appearances)
- Some columns are kind of a mystery to me (Not quite sure what G_old in Batting is)

The data encoding I could probably take care of by hand given that there appears to only be a couple dozen, but the big question I have is what should we do about the column names since keeping them as is would likely be breaking changes for anything using an older version of this package.

I'm also wondering if maybe these are export artifacts, so my next step is to load the SQL version (looks like it's a SQL Server backup file based on the name) and see what there is to see. This might be something that can be solved with a couple of export tweaks.

skamanrev commented 4 months ago

For what it’s worth . I found a reference to G_old in the following link. It says it’s a deprecated version of G in the batting table.

This may be a Red Herring but thought I'd let you know

Cheers

MarkE

https://lahman.r-forge.r-project.org/doc/Batting.html

On Mon, May 6, 2024 at 17:20 Chris Dalzell @.***> wrote:

This year's Lahman data update has been released which means it's time to update this package accordingly.

After an initial look at the CSV data files it looks like we may have some import changes or cleanup to do:

People & Parks Data Encoding: It looks like a couple dozen player names and a couple of park names have UTF-8 "replacement characters in them (EF BF BD).

Column Name Changes & Additions. Here's a quick list of some of the things I've noticed:

Some CSV files now have an additional ID field

Some columns have case mismatches (yearid vs yearID)

Some columns have formatting changes (parkkey vs park.key)

Some columns are appearing places they weren't before (G_batting is now in Batting and Appearances instead of just Appearances)

Some columns are kind of a mystery to me (Not quite sure what G_old in Batting is)

The data encoding I could probably take care of by hand given that there appears to only be a couple dozen, but the big question I have is what should we do about the column names since keeping them as is would likely be breaking changes for anything using an older version of this package.

I'm also wondering if maybe these are export artifacts, so my next step is to load the SQL version (looks like it's a SQL Server backup file based on the name) and see what there is to see. This might be something that can be solved with a couple of export tweaks.

— Reply to this email directly, view it on GitHub https://github.com/cdalzell/Lahman/issues/70, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQKEKTJRWUHK52GQQV2PS3ZBAM45AVCNFSM6AAAAABHJ6ZIOSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI4DCOJZG4YDKNI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

cdalzell commented 4 months ago

Looking at the SQL Server version:

People & Parks Data Encoding: Looks like they're correct in the DB
Column Name Changes & Additions: Looks like these match what's in the CSV files

Other notes:

The ID columns are only in some tables and don't look to be listed in the readme file
G_batting in Batting only has values for 2023 data. G_old appears to be entirely null and neither are in the readme file
The name changes are also in the DB

Might be OK to not import those ID, G_batting, etc columns. Not sure what to do about the name changes though, I'm inclined to map them back to what they were in previous versions but that might not be the right course of action here.

cdalzell commented 4 months ago

@skamanrev Interesting! Thanks for finding that, looks like it's probably OK to remove that one then.

skamanrev commented 4 weeks ago

Hi just curious. Will the package be updated to include the 2023 data? Is any help required (testing perhaps)?

Cheers Mark E

cdalzell commented 4 weeks ago

Yes, there will be an update with the 2023 data and as luck would have it I was planning on finally doing so this weekend. Possibly even before if a couple of breaks go my way.

Sorry for the delay. There's a few reasons this happened but it's primarily due to me having an unusually busy spring & summer.

cdalzell commented 3 weeks ago

At long last, the encoding and schema drift issues should be fixed.

Two NOTES from winbuilder:

  URL: https://rdatasciencecases.org/ (moved to https://facts.net/science/technology/15-facts-about-data-science/)
    From: inst/doc/payroll.html
    Status: 301
    Message: Moved Permanently

Package has 'vignettes' subdirectory but apparently no vignettes.
Perhaps the 'VignetteBuilder' information is missing from the
DESCRIPTION file?

Locally I'm still getting the usual file size NOTE, but that didn't seem to be an issue last year so hopefully that's still the case.

cdalzell / Lahman

2024 Data Update #70