Sicheng2000 / lab-10

Lab 10: Statistical inference
Creative Commons Attribution 4.0 International
0 stars 0 forks source link

Lab 10 feedback #1

Open Sicheng2000 opened 4 months ago

Sicheng2000 commented 4 months ago

@francojc

  1. a. Inferential data analysis tests data only once because it is based on testing research questions so the majority of the time is focused on having assumptions or research hypotheses. b. IDA involves 4 steps: identify response and explanatory variables, inspect variables, interrogate statistical procedure, and interpret data. c. Null hypothesis claims that there is nothing happens. d. p-value is a metric for determining whether the results are statistically significant or not. e. confidence interval is a range where true values probably fall into. f. bootstrap distribution resampling the data with replacement.
  2. I think is deciding the model, even though in class, we talked about choosing the model variable types, and the number of variables, I am still not sure which type of model should I try first.
  3. During the transform data process, I found the translation document in ENNTT is too big even though I added cache: true. In addition, I forgot to add data/original/* in .gitignore the first time it made me hard to push to GitHub (https://stackoverflow.com/questions/51764180/file-too-large-added-to-git-ignore-but-still-trying-to-add) even though I tried to download git lf (https://stackoverflow.com/questions/48734119/git-lfs-is-not-a-git-command-unclear). For the model choice, I based it on the same model used in the textbook (https://qtalr.github.io/qtalrkit/articles/recipe-10.html). Because the translation file is too big to read_html, so I could not visualize the data. Although there is a function we could use, it still does not work because it takes too much time to run, so my Rstudio just terminates automatically.
  4. I am curious about this ENNTT, we face imbalanced data. In Lab 09, it suggests step_downsample(). However, it seems step_downsample() will remove rows of data which may influence accuracy. So I searched for information about when should use step_downsample() (https://medium.com/@rithpansanga/choosing-the-right-size-a-look-at-the-differences-between-upsampling-and-downsampling-methods-daae83915c19). It seems for large dataset, downsampling is preferred. The link I provided talks about downsampling and upsampling's pros and cons and the when to use them.
francojc commented 4 months ago

Very detailed response. Thank you!

On #2: Consider the tables in Chapter 10 to help you decide on which statistic to calculate for the statistical design of your hypothesis. Once you know that, the rest of the workflow is very similar for (almost) all the other tests.

On #3: Yes, the ENNTT is very large. It will take a lot of computer processing resources. And yes, if you accidentally (try to) push the data to GitHub you will get an error --GitHub is not made to store large files. There is a way around it, but in this case it is just best to add the data/datasets to the .gitignore file.

On #4: Down sampling is not without it's potential issues, but the down sampling reduces the dataset size while attempting to maintain a similar variation pattern as the full dataset.