dcomtois / summarytools

R Package to Quickly and Neatly Summarize Data
512 stars 77 forks source link

Suggestion: identify when an ID columns contains a checksum #45

Closed paulfeitsma closed 5 years ago

paulfeitsma commented 5 years ago

When a data set contains an ID which has a checksum, this is very useful to know. E.g. when bar codes are used (EAN https://en.wikipedia.org/wiki/International_Article_Number) it is very useful to know, especially when column names are not obvious.

dcomtois commented 5 years ago

Potentially feasible, but not sure where that information could be displayed... Also some implementation concerns -- false positives & slowing down the function in particular.

dcomtois commented 5 years ago

After reading about the different kinds of codes, I think it would be a good thing to add to dfSummary(). To avoid slowing down the function, we could check only a few dozens of values. There is some regex and math involved, nothing really difficult. Only tedious.

I put together useful links that describe validation rules. I think that the EAN and UPC are pretty straightforward. But there are others that could also be useful, although to a lesser extent. If anyone wants to check them out and come up with identification methods, please do so. I should be able to work out the EAN and UPC fairly quickly.

Wikipedia EAN: https://en.wikipedia.org/wiki/International_Article_Number UPC: https://en.wikipedia.org/wiki/Universal_Product_Code

Links with computational examples UPC/EAN: https://stackoverflow.com/questions/10143547/how-do-i-validate-a-upc-or-ean-code UPC/EAN/ISBN/ISSN/JAN: http://www.azalea.com/upctools/ UPC/EAN/ISBN/ASIN: https://github.com/ThomasPe/ProductCodeValidator/tree/master/ProductCodeValidator

And this real pearl https://www.gs1us.org/DesktopModules/Bring2mind/DMX/Download.aspx?Command=Core_Download&EntryId=729&language=en-US&PortalId=0&TabId=134

dcomtois commented 5 years ago

I was able to include the functionality for EAN and UPC codes. Feedback is welcome if anyone has time to try it out. I had a hard time finding example codes so I haven't done extensive testing.

It's only in the development version, so you'll need to:

devtools::install_github('dcomtois/summarytools', ref='dev-current')

paulfeitsma commented 5 years ago

you made a typo:

devtools::install_github('dcomtois/summarytools', ref='dev-current')

paulfeitsma commented 5 years ago

I tested it using the file https://www.stedin.net/zakelijk/~/media/files/stedin/open-data/stedin_kleinverbruiksgegevens_01012018.zip in which the first column contains an EAN13. I used the following code to read the dataset and make a data frame summary.

library(data.table)
options(datatable.integer64="character")
dt_verbruik <- fread("20180129_OpenData_KV_Verbruiksdata_2018.csv")
view(dfSummary(dt_verbruik))

This gives the following error: "Error in paste0(counts_props, collapse = "\\n") : object 'counts_props' not found".

dcomtois commented 5 years ago

It's fixed. Works fine now. Thanks again!

paulfeitsma commented 5 years ago

I tested this feature by making a data frame summary of the file mentioned above with both the dev-current version (december 17th) and the CRAN version. You can see the results below.

summarytools_checksum

Below is the CRAN version and on top the dev-current version. The good news is that the EAN-13 network operator code was successfully being detected. Unfortunately in the stats/value column the min/max/mode summary is being displayed. In the Freqs column a long list of frequencies is being showed, which is not very readable/nice imho.

Therefore my suggestion would be to leave the stats/value and freqs column like the current CRAN version and mention in the variable column that an EAN-13 was being detected.

dcomtois commented 5 years ago

Makes sense!

raheems commented 5 years ago

Is the issue with showing histograms in rmarkdown resolved yet? I've just tried with the dev-current version and I do not notice any change. Thanks for the great work.

dcomtois commented 5 years ago

Hi Raheems. I'm working on something. It's a hard one to solve so I'm not sure it will be part of the 0.9.0 release, partly because of the way rmarkdown treats leading white space. But I,ve made some progress. I'll soon create a branch dedicated to that problem. I'll keep you posted!

dcomtois commented 5 years ago

@raheems In the meantime, have you tried the settings described in the vignette Recommendations for Using summarytools With Rmarkdown?

raheems commented 5 years ago

@raheems In the meantime, have you tried the settings described in the vignette Recommendations for Using summarytools With Rmarkdown?

Yes, thanks.

dcomtois commented 5 years ago

@paulfeitsma I made the changes necessary, haven't extensively tested it but so far looks better now. @raheems I've added the "col.widths" parameter to dfSummary(). See issue #57. Not sure if that is sufficient but I think it's a step in the right direction.

Thx