Open k8hertweck opened 4 years ago
It looks like some files were mixed up on the GitHub Classroom version of the assignment - the readme and python notebook are from homework 5. The correct versions of those files are included on the TFCB Homework 7 GitHub page. Should we download those files and delete the incorrect ones from our repository?
@zyaffe Thanks for letting me know I massively screwed up the files in the GitHub Classroom repo! You are correct, the appropriate files can be found in this repository. More specifically, you should replace the README and ipython notebook files.
For problem 1 of the unix section it asks for the "100th sequence in the file," should we provide general code that work on any of the files or is there one in particular?
@stephen-rettie Thanks for asking for clarification! Any of the files in the dataset downloaded in Question 0 are acceptable (though I believe I'd intended for it to be the first file listed with ls
)
I ran into an issue on question 4.
Using the non-scaled dataframe from question 3 the answer in question 4 looks fine, the same plot as question three but colored by cluster. However, if I use the scaled dataframe from question 3 then the points in question 4 change positions leading to a different plot entirely.
It appears like this is due to the part of question 4 where we add the cluster column to the dataframe before doing principle component analysis. Even keeping the pca = PCA(n_components=10)
it looks like it uses the clusters column and changes the plot. I think this is because n_components picks the 10 best dimensions and in the scaled dataframe the clusters column contains numbers greater than 1 so it gets chosen, whereas in the unscaled dataframe I think it doesn't because the values are much larger.
Not adding the clusters column to the scaled dataframe in question 4 leads to the expected result of the same plot from 3 but colored by cluster.
Hi Kate,
Could you specify a bit more on what Problem 2 is asking for? Should we make an environment for the python script?
Thank you! Yuzhen
Using the non-scaled dataframe from question 3 the answer in question 4 looks fine, the same plot as question three but colored by cluster. However, if I use the scaled dataframe from question 3 then the points in question 4 change positions leading to a different plot entirely.
It appears like this is due to the part of question 4 where we add the cluster column to the dataframe before doing principle component analysis. Even keeping the
pca = PCA(n_components=10)
it looks like it uses the clusters column and changes the plot. I think this is because n_components picks the 10 best dimensions and in the scaled dataframe the clusters column contains numbers greater than 1 so it gets chosen, whereas in the unscaled dataframe I think it doesn't because the values are much larger.Not adding the clusters column to the scaled dataframe in question 4 leads to the expected result of the same plot from 3 but colored by cluster.
@stephen-rettie I'll give the quick answer first, that's relevant to everyone in class: the homework does not require you to proceed with the standardized dataset (in fact, it's easier for use to grade if you do not).
Now for the longer answer: if you compare plots between results from Kmeans clustering of unscaled and scaled datasets for question 4, they should indeed appear similar. Without seeing your code, I'm not sure what the exact issue with your dataset is; you're correct that the addition of a cluster column to the dataset would influence the results of a PCA. I'm not quite sure what you're asking here, so let me know if this isn't clear.
Could you specify a bit more on what Problem 2 is asking for? Should we make an environment for the python script?
@yliu234 This question is asking you to handle a common problem in computational biology: you need to run an analysis but aren't sure if the tools are available. First figure out what tools (software) are required to run the python script. Then discover whether this software is available on rhino, and how you could access it. Please note that this question doesn't require you to actually run the script. I hope this makes more sense!
Please ask any questions about homework 7 here.