gabors-data-analysis / da_case_studies

Codes for case studies for the Bekes-Kezdi Data Analysis textbook
MIT License
174 stars 158 forks source link

Data exercise missing data - billion prices project pg 166 #108

Open rdisalv2 opened 1 year ago

rdisalv2 commented 1 year ago

The data exercise on page 166 using the billion prices project, question 3, asks to restrict the data to prices that are assessed on the same day. But the dataset used in the case study doesn't seem to have a variable for that, or a variable that permits construction of that:

Contains data from online_offline_ALL_clean.dta
  obs:        45,253                          
 vars:            21                          26 Aug 2016 17:24
------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------
COUNTRY         str12   %12s                  
retailer        float   %9.0g                 group(country retailer)
retailer_s      str14   %14s                  
date            float   %td                   
day             byte    %9.0g                 
month           byte    %9.0g                 
year            int     %9.0g                 
id              str61   %61s                  BARCODE
price           double  %10.0g                PRICE
price_online    double  %10.0g                
imputed         byte    %9.0g                 
DEVICEID        str16   %16s                  DEVICE ID
TIME            str5    %9s                   TIME
ZIPCODE         str21   %21s                  ZIP CODE
PHOTO           str19   %19s                  PHOTO
OTHERSKUITEM    str38   %38s                  OTHER SKU/ITEM#
COMMENTS        str168  %168s                 COMMENTS
PRICETYPE       str21   %21s                  PRICE TYPE
CODE            str6    %9s                   CODE
sale_online     byte    %12.0g                
country_s       str12   %12s                  

it's a good question otherwise, I'd love to use it

gbekes commented 1 year ago

Hi, thanks. Well it's been a while, but I guess the idea is to find products where date (day, month and year) is the same and look those instances only.

rdisalv2 commented 1 year ago

Thanks. Oh I see, date and (day, month, year) are the two different dates. I thought they were the same date. (They are different from each other in 1.3% of cases) The codebook from the dataverse materials from the paper says that they're the same though:

date            float   %td                   Date for offline data collection, in stata format
day             byte    %9.0g                 Day for offline data collection
month           byte    %9.0g                 Month for offline data collection
year            int     %9.0g                 Year for offline data collection

BUT, the codebook from the dataverse also has this

imputed         byte    %9.0g                 =0 if the online price was collected on the exact same day (otherwise it was collected within 7 days)

which seems promising but it's tab is weird

. tab imputed, m

    imputed |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |     22,414       49.53       49.53
          . |     22,839       50.47      100.00
------------+-----------------------------------
      Total |     45,253      100.00
rdisalv2 commented 1 year ago

I just checked the xlsx file from the dataverse replication. the . has to be 0, because that column is just a 1 or a blank in the xlsx. So I think keep if missing(imputed) would be the way to keep only same-days