Closed paulfeitsma closed 5 years ago
Potentially feasible, but not sure where that information could be displayed... Also some implementation concerns -- false positives & slowing down the function in particular.
After reading about the different kinds of codes, I think it would be a good thing to add to dfSummary(). To avoid slowing down the function, we could check only a few dozens of values. There is some regex and math involved, nothing really difficult. Only tedious.
I put together useful links that describe validation rules. I think that the EAN and UPC are pretty straightforward. But there are others that could also be useful, although to a lesser extent. If anyone wants to check them out and come up with identification methods, please do so. I should be able to work out the EAN and UPC fairly quickly.
Wikipedia EAN: https://en.wikipedia.org/wiki/International_Article_Number UPC: https://en.wikipedia.org/wiki/Universal_Product_Code
Links with computational examples UPC/EAN: https://stackoverflow.com/questions/10143547/how-do-i-validate-a-upc-or-ean-code UPC/EAN/ISBN/ISSN/JAN: http://www.azalea.com/upctools/ UPC/EAN/ISBN/ASIN: https://github.com/ThomasPe/ProductCodeValidator/tree/master/ProductCodeValidator
And this real pearl https://www.gs1us.org/DesktopModules/Bring2mind/DMX/Download.aspx?Command=Core_Download&EntryId=729&language=en-US&PortalId=0&TabId=134
I was able to include the functionality for EAN and UPC codes. Feedback is welcome if anyone has time to try it out. I had a hard time finding example codes so I haven't done extensive testing.
It's only in the development version, so you'll need to:
devtools::install_github('dcomtois/summarytools', ref='dev-current')
you made a typo:
devtools::install_github('dcomtois/summarytools', ref='dev-current')
I tested it using the file https://www.stedin.net/zakelijk/~/media/files/stedin/open-data/stedin_kleinverbruiksgegevens_01012018.zip in which the first column contains an EAN13. I used the following code to read the dataset and make a data frame summary.
library(data.table)
options(datatable.integer64="character")
dt_verbruik <- fread("20180129_OpenData_KV_Verbruiksdata_2018.csv")
view(dfSummary(dt_verbruik))
This gives the following error: "Error in paste0(counts_props, collapse = "\\n") : object 'counts_props' not found".
It's fixed. Works fine now. Thanks again!
I tested this feature by making a data frame summary of the file mentioned above with both the dev-current version (december 17th) and the CRAN version. You can see the results below.
Below is the CRAN version and on top the dev-current version. The good news is that the EAN-13 network operator code was successfully being detected. Unfortunately in the stats/value column the min/max/mode summary is being displayed. In the Freqs column a long list of frequencies is being showed, which is not very readable/nice imho.
Therefore my suggestion would be to leave the stats/value and freqs column like the current CRAN version and mention in the variable column that an EAN-13 was being detected.
Makes sense!
Is the issue with showing histograms in rmarkdown resolved yet? I've just tried with the dev-current version and I do not notice any change. Thanks for the great work.
Hi Raheems. I'm working on something. It's a hard one to solve so I'm not sure it will be part of the 0.9.0 release, partly because of the way rmarkdown treats leading white space. But I,ve made some progress. I'll soon create a branch dedicated to that problem. I'll keep you posted!
@raheems In the meantime, have you tried the settings described in the vignette Recommendations for Using summarytools With Rmarkdown?
@raheems In the meantime, have you tried the settings described in the vignette Recommendations for Using summarytools With Rmarkdown?
Yes, thanks.
@paulfeitsma I made the changes necessary, haven't extensively tested it but so far looks better now. @raheems I've added the "col.widths" parameter to dfSummary(). See issue #57. Not sure if that is sufficient but I think it's a step in the right direction.
Thx
When a data set contains an ID which has a checksum, this is very useful to know. E.g. when bar codes are used (EAN https://en.wikipedia.org/wiki/International_Article_Number) it is very useful to know, especially when column names are not obvious.