fredhutchio / tfcb_2019

class materials for MCB517A through UW/Fred Hutch
10 stars 10 forks source link

Questions for homework 7: python machine learning and remote computing #39

Open k8hertweck opened 4 years ago

k8hertweck commented 4 years ago

Please ask any questions about homework 7 here.

zyaffe commented 4 years ago

It looks like some files were mixed up on the GitHub Classroom version of the assignment - the readme and python notebook are from homework 5. The correct versions of those files are included on the TFCB Homework 7 GitHub page. Should we download those files and delete the incorrect ones from our repository?

k8hertweck commented 4 years ago

@zyaffe Thanks for letting me know I massively screwed up the files in the GitHub Classroom repo! You are correct, the appropriate files can be found in this repository. More specifically, you should replace the README and ipython notebook files.

stephen-rettie commented 4 years ago

For problem 1 of the unix section it asks for the "100th sequence in the file," should we provide general code that work on any of the files or is there one in particular?

k8hertweck commented 4 years ago

@stephen-rettie Thanks for asking for clarification! Any of the files in the dataset downloaded in Question 0 are acceptable (though I believe I'd intended for it to be the first file listed with ls)

stephen-rettie commented 4 years ago

I ran into an issue on question 4.

Using the non-scaled dataframe from question 3 the answer in question 4 looks fine, the same plot as question three but colored by cluster. However, if I use the scaled dataframe from question 3 then the points in question 4 change positions leading to a different plot entirely.

It appears like this is due to the part of question 4 where we add the cluster column to the dataframe before doing principle component analysis. Even keeping the pca = PCA(n_components=10) it looks like it uses the clusters column and changes the plot. I think this is because n_components picks the 10 best dimensions and in the scaled dataframe the clusters column contains numbers greater than 1 so it gets chosen, whereas in the unscaled dataframe I think it doesn't because the values are much larger.

Not adding the clusters column to the scaled dataframe in question 4 leads to the expected result of the same plot from 3 but colored by cluster.

yliu234 commented 4 years ago

Hi Kate,

Could you specify a bit more on what Problem 2 is asking for? Should we make an environment for the python script?

Thank you! Yuzhen

k8hertweck commented 4 years ago

Using the non-scaled dataframe from question 3 the answer in question 4 looks fine, the same plot as question three but colored by cluster. However, if I use the scaled dataframe from question 3 then the points in question 4 change positions leading to a different plot entirely.

It appears like this is due to the part of question 4 where we add the cluster column to the dataframe before doing principle component analysis. Even keeping the pca = PCA(n_components=10) it looks like it uses the clusters column and changes the plot. I think this is because n_components picks the 10 best dimensions and in the scaled dataframe the clusters column contains numbers greater than 1 so it gets chosen, whereas in the unscaled dataframe I think it doesn't because the values are much larger.

Not adding the clusters column to the scaled dataframe in question 4 leads to the expected result of the same plot from 3 but colored by cluster.

@stephen-rettie I'll give the quick answer first, that's relevant to everyone in class: the homework does not require you to proceed with the standardized dataset (in fact, it's easier for use to grade if you do not).

Now for the longer answer: if you compare plots between results from Kmeans clustering of unscaled and scaled datasets for question 4, they should indeed appear similar. Without seeing your code, I'm not sure what the exact issue with your dataset is; you're correct that the addition of a cluster column to the dataset would influence the results of a PCA. I'm not quite sure what you're asking here, so let me know if this isn't clear.

k8hertweck commented 4 years ago

Could you specify a bit more on what Problem 2 is asking for? Should we make an environment for the python script?

@yliu234 This question is asking you to handle a common problem in computational biology: you need to run an analysis but aren't sure if the tools are available. First figure out what tools (software) are required to run the python script. Then discover whether this software is available on rhino, and how you could access it. Please note that this question doesn't require you to actually run the script. I hope this makes more sense!