Milestone 1 feedback - Githubissues

2. Project set-up: Mechanics Comments Who's on your team?

Don't need both MD and RMD read me, only one. Usually just keep the MD.

3. Project proposal: reasoning Comments "Sub Exploratory Questions" As a suggestion, maybe put objectives or aims and rephrase them into statements? For example, it's confusing what sub-questions are.

What are the proposed methods?

How will you clean the data if it has missing values?

What if there are class imbalances? Will you create synthetic data?

What about the other measurements, is there a reason you only picked one?

For your hypothesis test, are you only utilizing values from 2013 and 2017? If so, then the visualization of the rest of the years in between is sort of useless. Would it be more interesting if you did it year by year and a range to show progress?

It looks like you do do this "March 1 2013 to Feb 28 2015, and March 1 2015 to Feb 28 2017" But you don't say in the project description

Lastly, for this hypothesis test, is there no control? What about comparing it with other countries? you won't know what the global rate of increase is, and without knowing that, how would you know if the whole world was increasing and not just Beijing?

4. A script that downloads the data: Accuracy Comments I need to grab the CSV file myself.

4. A script that downloads the data: Quality Comments We are missing the csv with everything joined in the table.

5. Exploratory data analysis in a literate code document: VIZ Comments Need the correct labels, i don't know what Time A and Time B are.

5. Exploratory data analysis in a literate code document: REASONING Comments Could have utilized more plots

Hi Andy. Thank you for your feedback. We have discussed your comments within our group, and we have several queries.

For our understanding, we are working on the project not just EDA. Therefore, we assume that data is downloaded from the script and we put different details in ReadMe & EDA. We lost marks in

2. Project set-up: Mechanics
Who's on your team?
Don't need both MD and RMD read me, only one. Usually just keep the MD.

5. Exploratory data analysis in a literate code document: VIZ
Need the correct labels, i don't know what Time A and Time B are.

There isn’t a clear instruction or in the industry on where we put our name. We have our names in ReadMe, time A and B are mentioned in the ReadMe and EDA, and we took reference on the breast cancer project, it is the same structure. (Name in ReadMe but not Rmd and have MD and RMD files in the repo) https://github.com/ttimbers/breast_cancer_predictor/blob/v2.0/src/breast_cancer_eda.md

A script that downloads the data: Accuracy
I need to grab the CSV file myself.

A script that downloads the data: Quality
We are missing the csv with everything joined in the table.

We have the python script to download and unzip the files and put it in the dedicated folder, where the instruction is written in ReadMe, tested on our computer and the script works. On top of that, we also have the csv in our repo. But it looks like TA just run the Rmd script?

3. Project proposal: reasoning
What are the proposed methods?

How will you clean the data if it has missing values?

What if there are class imbalances? Will you create synthetic data?

What about the other measurements, is there a reason you only picked one? 

For your hypothesis test, are you only utilizing values from 2013 and 2017? If so, then the visualization of the rest of the years in between is sort of useless. Would it be more interesting if you did it year by year and a range to show progress?

It looks like you do do this "**March 1 2013 to Feb 28 2015**, and **March 1 2015 to Feb 28 2017**" But you don't say in the project description 

Lastly, for this hypothesis test, is there no control? What about comparing it with other countries? you won't know what the global rate of increase is, and without knowing that, how would you know if the whole world was increasing and not just Beijing?

5. Exploratory data analysis in a literate code document: REASONING
Could have utilized more plots

5. Exploratory data analysis in a literate code document: ACCURACY
(No comment?)

We have detailed explanation on ReadMe about methods therefore a brief explanation in EDA but focused on the data, also, we followed the instruction on, e.g. visualisation https://pages.github.ubc.ca/mds-2021-22/DSCI_522_dsci-workflows_students/materials/assignments/milestone1.html#project-proposal

@flor14 @andytai7

Hello! I have a look at your project. Congratulations for the work so far. I noticed that you got 84,55% for this milestone, which means that you have done quite good work with it.

Nice scripts! Unfortunately, the first thing I noticed it is that there is a typo in the documentation and I can not reproduce your analysis (it should say Beijing_air_quality_EDA.Rmd not Beijing_air_quality.Rmd). Also, I got this error when trying to download the data
```
python src/download_data.py --url=https://archive.ics.uci.edu/ml/machine-learning-databases/00501/PRSA2017_Data_20130301-20170228.zip --out_folder=data/raw
Checking URL connection...
Unzipping file...
Failed unzip file.
[Errno 2] No such file or directory: '/tmp/tempfile.zip'
```
I think that considering that the course it is about reproducibility it seems to me that discounting some points it is appropriated in the sections @andytai7 did it.
```
Who's on your team?
Don't need both MD and RMD read me, only one. Usually just keep the MD.
```
These are suggestions, you can take them or not. You can read here more about TAs feedback https://github.ubc.ca/mds-2021-22/DSCI_522_dsci-workflows_students/issues/35
I agree that the data visualizations could be improved, I will talk today about this in the lab, reach me to discuss this. I prefer to work with your current version. In brief, the plot should be understood by itself (figure captions + legend + labels), so Andy's comment it is pertinent.

Reproducibility is hard, something that runs in one computer could not run in other. I think that @andytai7 grades are appropriate for this stage of the project. My recommendation is to use the feedback to improve your work as much you can.

@flor14 Hello Florencia, I actually tested out script on downloading the data with the first release files, and it ran perfectly fine and the script created the raw data folder successfully. I have recorded my screen on the whole process from downloading first release files from github.com and open the script to run the download command line (given in Usage section) @andytai7 I could forward the screen recording to both of you on Slack if you would like to take a look on that. Thank you.

Thank you Andy for your feedback:

Regarding your feedback item 3, we addressed the issue regarding missing data with necessary adjustments via the following commit: [29797052ae4fb7b5be949270e6198f5a6daadb0a]
Regarding your feedback item 5, we have added Yearly and Monthly summary plots via the following commit: [70c5e89a0b3a491d569d38cd30ca7b68ff3dce04]
Regarding the design of our hypothesis test, we chose to use two multi-year intervals rather than specific years as part of our data split, so as to ensure that we make best possible use of the data available to us.
Finally, as for class imbalance, we specifically chose to deliberately separate data in such as way to avoid class imbalance. Hence, we did not discuss this concern further.

UBC-MDS / beijing_air_quality_analysis

Milestone 1 feedback #77