UBC-MDS / DSCI_522_Group-308_Used-Cars

This project attempts to build a regression model to predict price of used cars based on numerous features of the car
MIT License
2 stars 6 forks source link

Data download script reproducability #15

Closed ksedivyhaley closed 4 years ago

ksedivyhaley commented 4 years ago

Unable to run download script on my machine: I get your error “Cached data file hash is invalid.” (Good use of progress/error messages, though!)

Note instruction in Milestone 1: “Also, to make things simple, I would avoid using data from cites where you have to authenticate to obtain the data (e.g., Kaggle). If that cannot be avoided, discuss with the lecture and lab instructor how you can do this reproducibly. ”

Possibly related: I'm not seeing the data file in your data folder in the repo. Too big for GitHub?

pokrovskyy commented 4 years ago

Hello Kate,

Thanks for your feedback. The script should be called with the default arguments explicitly specified to succeed. We have since improved the script to run without parameters (using the default ones)

The data file is huge (around 1.5 GB) and is downloaded from the Internet. Because of that, you may have failed to download it. Otherwise it should have appeared in the ../data folder (unless you provided custom parameters / paths)

Lastly, the script performs data file validation with MD5 checksum hash to ensure the right file is downloaded. Thus, if you did not specify the real data file URL, it may have failed to verify it, and thus failed (leaving your data folder empty)

What we did to improve usability:

Thanks!

ksedivyhaley commented 4 years ago

Hi Serg,

In the Milestone 1 submission the defaults weren't clearly specified in the documentation - looking again I found them commented here:

Define constants with key values

DATA_FILE_PATH = '../data/vehicles.csv'

DATA_FILE_HASH = '06e7bd341eebef8e77b088d2d3c54585'

DATA_FILE_URL = 'http://mds.dev.synnergia.com/uploads/vehicles.csv'

(I haven't written down the url used but I assume that I was using something from Kaggle that looked like the right URL but wasn't, hence failing as you described.)

I note that your current version has improved the documentation to make the intended URL obvious, so that fixes the initial issue - script ran as intended while I wrote up this issue! Usability improvements also look good.