UBC-MDS / DSCI_522_Group-308_Used-Cars

This project attempts to build a regression model to predict price of used cars based on numerous features of the car
MIT License
2 stars 6 forks source link

Usage info in readme file is wrong #32

Closed firasm closed 4 years ago

firasm commented 4 years ago

python scripts/download.py --DATA_FILE_PATH=../data/vehicles.csv --DATA_FILE_URL=http://mds.dev.synnergia.com/uploads/vehicles.csv --DATA_FILE_HASH=06e7bd341eebef8e77b088d2d3c54585

does not run after I clone the repo.

Error is:

Checking cached data file... NONE
Traceback (most recent call last):
  File "scripts/download.py", line 76, in <module>
    main(opt["--DATA_FILE_PATH"], opt["--DATA_FILE_URL"], opt["--DATA_FILE_HASH"])
  File "scripts/download.py", line 28, in main
    download_data(file_path, file_URL, file_hash)
  File "scripts/download.py", line 52, in download_data
    lambda block, block_size, total_size: sys.stdout.write(
  File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py", line 257, in urlretrieve
    tfp = open(filename, 'wb')
FileNotFoundError: [Errno 2] No such file or directory: '../data/vehicles.csv'

Normally (in a real project) I'd tell you the fix, but in this case I want you to find it and fix it :-)

pokrovskyy commented 4 years ago

Thanks for the comment. As far as I can see, you ran the code from the root repo folder by calling python scripts/download.py, however supplied a path relative to the local repository root folder. So the script attempted to download the data file into folder that apparently did not exist on your PC. In this case you should have specified --DATA_FILE_PATH=data/vehicles.csv or run the script from scripts folder. One thing I could do though is handle this case by providing more meaningful error message, like "Invalid download location specified" etc. Is that what's on your mind? I don't think the script should recursively create any additional folders for the sake of security

pokrovskyy commented 4 years ago

Alternatively, you could have run the script without parameters (from the scripts folder) Eventually, this will be handled in the single batch.

Also, could you please comment on why do we even have to request paths through command line arguments? My original understanding was that this should be restricted to the internal repository folder structure (preset in the script) Instead, you could provide some additional parameters to other scripts, like train / test split % etc. But not the paths

firasm commented 4 years ago

I just followed the instructions you had in your README under Usage:

To replicate the analysis, clone this GitHub repository, install the dependencies listed below, and run the following commands at the command line/terminal from the root directory of this project:

# download data
python scripts/download.py --DATA_FILE_PATH=../data/vehicles.csv --DATA_FILE_URL=http://mds.dev.synnergia.com/uploads/vehicles.csv --DATA_FILE_HASH=06e7bd341eebef8e77b088d2d3c54585

# data wrangling
Rscript scripts/wrangling.R --DATA_FILE_PATH=../data/vehicles.csv --TRAIN_FILE_PATH=../data/vehicles_train.csv --TEST_FILE_PATH=../data/vehicles_test.csv --TARGET=price --REMOVE_OUTLIERS=YES --TRAIN_SIZE= 0.9

Tiffany mentioned in class that you should run your scripts from the root directory so you are always doing stuff in subdirectories from the current location, rather than worrying about going up directories using ..

pokrovskyy commented 4 years ago

Thanks Firas! Doing the group meeting now, just realized it's a bug in the Readme, will update accordingly. As I said, eventually this will be all encapsulated in the Makefile. Also, we will remove default options from our Readme. It's confusing to provide all of this when it's supposed to be run without parameters. We'll make it super-clean and intuitive, with retained flexibility for those requiring advanced usage. Corresponding tasks assigned!

pokrovskyy commented 4 years ago

Closed in favour of #33