Open cdalzell opened 4 months ago
For what it’s worth . I found a reference to G_old in the following link. It says it’s a deprecated version of G in the batting table.
This may be a Red Herring but thought I'd let you know
Cheers
MarkE
https://lahman.r-forge.r-project.org/doc/Batting.html
On Mon, May 6, 2024 at 17:20 Chris Dalzell @.***> wrote:
This year's Lahman data update has been released which means it's time to update this package accordingly.
After an initial look at the CSV data files it looks like we may have some import changes or cleanup to do:
- People & Parks Data Encoding: It looks like a couple dozen player names and a couple of park names have UTF-8 "replacement characters in them (EF BF BD).
- Column Name Changes & Additions. Here's a quick list of some of the things I've noticed:
- Some CSV files now have an additional ID field
- Some columns have case mismatches (yearid vs yearID)
- Some columns have formatting changes (parkkey vs park.key)
- Some columns are appearing places they weren't before (G_batting is now in Batting and Appearances instead of just Appearances)
- Some columns are kind of a mystery to me (Not quite sure what G_old in Batting is)
The data encoding I could probably take care of by hand given that there appears to only be a couple dozen, but the big question I have is what should we do about the column names since keeping them as is would likely be breaking changes for anything using an older version of this package.
I'm also wondering if maybe these are export artifacts, so my next step is to load the SQL version (looks like it's a SQL Server backup file based on the name) and see what there is to see. This might be something that can be solved with a couple of export tweaks.
— Reply to this email directly, view it on GitHub https://github.com/cdalzell/Lahman/issues/70, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABQKEKTJRWUHK52GQQV2PS3ZBAM45AVCNFSM6AAAAABHJ6ZIOSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI4DCOJZG4YDKNI . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Looking at the SQL Server version:
People
& Parks
Data Encoding: Looks like they're correct in the DBOther notes:
ID
columns are only in some tables and don't look to be listed in the readme fileG_batting
in Batting
only has values for 2023 data. G_old
appears to be entirely null and neither are in the readme fileMight be OK to not import those ID
, G_batting
, etc columns. Not sure what to do about the name changes though, I'm inclined to map them back to what they were in previous versions but that might not be the right course of action here.
@skamanrev Interesting! Thanks for finding that, looks like it's probably OK to remove that one then.
Hi just curious. Will the package be updated to include the 2023 data? Is any help required (testing perhaps)?
Cheers Mark E
Yes, there will be an update with the 2023 data and as luck would have it I was planning on finally doing so this weekend. Possibly even before if a couple of breaks go my way.
Sorry for the delay. There's a few reasons this happened but it's primarily due to me having an unusually busy spring & summer.
At long last, the encoding and schema drift issues should be fixed.
Two NOTES from winbuilder:
URL: https://rdatasciencecases.org/ (moved to https://facts.net/science/technology/15-facts-about-data-science/)
From: inst/doc/payroll.html
Status: 301
Message: Moved Permanently
Package has 'vignettes' subdirectory but apparently no vignettes.
Perhaps the 'VignetteBuilder' information is missing from the
DESCRIPTION file?
Locally I'm still getting the usual file size NOTE, but that didn't seem to be an issue last year so hopefully that's still the case.
This year's Lahman data update has been released which means it's time to update this package accordingly.
After an initial look at the CSV data files it looks like we may have some import changes or cleanup to do:
People
&Parks
Data Encoding: It looks like a couple dozen player names and a couple of park names have UTF-8 "replacement characters in them (EF BF BD
).ID
fieldyearid
vsyearID
)parkkey
vspark.key
)G_batting
is now inBatting
andAppearances
instead of justAppearances
)G_old
inBatting
is)The data encoding I could probably take care of by hand given that there appears to only be a couple dozen, but the big question I have is what should we do about the column names since keeping them as is would likely be breaking changes for anything using an older version of this package.
I'm also wondering if maybe these are export artifacts, so my next step is to load the SQL version (looks like it's a SQL Server backup file based on the name) and see what there is to see. This might be something that can be solved with a couple of export tweaks.