RobertsLab / resources

https://robertslab.github.io/resources/
19 stars 11 forks source link

Oyster Proteomics 2xTemp time series statistical analyses to try #387

Closed shellywanamaker closed 6 years ago

shellywanamaker commented 6 years ago

Oyster proteomics: larvae in two different rearing temperatures

this info is also entered in my lab notebook

    -experiment: 15 days of rearing; one group (silo 3) at 23C and one (silo 9) at 29C  
    -questions to address:  
        1. Does 23C proteome differ from 29C proteome over time?  
        2. Do the two 23C proteomes differ over time (silo 2 vs. silo 3)  
    -need to determine which proteins between the two groups are different (e.g. show different abundance trends over time)  
        -Try ACSA [Ref for ASCA](https://www.ncbi.nlm.nih.gov/pubmed/15890747) ; [Ref for ASCA in metabolomics time series](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4559107/)  
        -Try Random Forest (can refer to its use in [MetaboAnalyst](http://www.metaboanalyst.ca/MetaboAnalyst/faces/home.xhtml); [MetaboAnalystR](https://github.com/xia-lab/MetaboAnalystR)) [Random Forest Methods in metabolomics](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5133491/)  
kaitlynrm commented 6 years ago

I'm trying to run the kmeans clustering with both silos. However when I run coef.hclust(clust.avg) I recieve the error Error in coef.hclust(clust.avg) : !is.unsorted(ht) is not TRUE. This coefficient tells me how distinct and dissimilar the clusters are from one another (closer 1 = more distinct/dissimilar).

That's not a huge deal (although foreshadowing of future errors) so I continued on. I was able to plot a dendrogram however when I try to cut the dendrogram clust.class<-cutree(clust.avg, h=0.8) I receive Error in cutree(clust.avg, h = 0.8) : the 'height' component of 'tree' is not sorted (increasingly).

I tried to research this error, but was having some trouble since my understanding of hierarchical clustering isn't super high. This forum was the most helpful thing I found.

From what I understand, using a median or centroid method of clustering can result in clusters that reverse and are thus not totally hierarchical which makes it impossible to cut them. However, I used the average, so that shouldn't be my problem... Another possibility is to cut based on clusters (k), but this seems to defeat the point of the cutting the dendrogram since I am trying to determine the number of clusters. Using abline() to find the best cutoff value for clusters doesn't work because I only have the dendrogram plotted with the h value, and I don't think it's possible/know how to plot k values.

Here is my code and files for running kmeans clustering if anyone has any suggestions....

kubu4 commented 6 years ago

@kaitlynrm - Can you post an example of functional code that uses these libraries/functions in a similar fashion?

Clearly, there's a sorting error that seems be produced when running clust.avg<-hclust(nsaf.bray, method='average'). The documentation only seems to suggest that the tree heights won't get sorted when using method="centroid" or method="median"), so not sure why that's happening here.

For example, no error is produced when setting method=ward.D.

Any chance you can use a different method? Alternatively, you'll need to figure out how to sort the output of hclust(nsaf.bray, method='average').

kaitlynrm commented 6 years ago

Sure! Code and datasheet.

Correct, I'm using average so what the forum suggests is an error shouldn't be an error, however that's the only problem I found receiving a similar error.

I am using bray-curtis not eucledian distances. Eucledian distances are required for ward methods.

Yeah, I'm not sure why clust.avg isn't sorting in this case (it worked on individual silos), and how it can plot a dendrogram if it isn't sorted.

kubu4 commented 6 years ago

So, it worked on data from individual silos, but not on a combined data set? Weird.

How did you combine the two data sets (e.g. script we can view)?

kaitlynrm commented 6 years ago

I know! I literally pasted one underneath the other in excel and saved it. All the silo data is formatted exactly the same. I didn't merge (or rbind) them in R because I want each detected protein to be associated with both silos (eg. 3-ALBU_BOVIN and 9-ALBU_BOVIN).

kubu4 commented 6 years ago

It doesn't look like you just copied/pasted, as evidenced by the 3- or 9- notation in the silo3and9.csv. That notation is not present in the individual CSV files (sil3.csv and silo9.csv). How did the notation get appended?

kaitlynrm commented 6 years ago

Concantenate("3_", cell) adds the 3 or 9 in excel.

kubu4 commented 6 years ago

I have an aside unrelated to this issue.

Concantenate("3_", cell) adds the 3 or 9 in excel.

Best practice would be to perform all of these types of manipulations in a reproducible fashion (i.e. concatenate two data frames in R script, append string to column in R script). Otherwise, it just seems as though data appears out of thin air with no way of tracking how they've been manipulated.

In turn, without knowing how files were generated, it complicates troubleshooting any issues that might arise.

kubu4 commented 6 years ago

@emmats - Would you happen to have any insight as to why @kaitlynrm's getting the error mentioned earlier in this issue?

emmats commented 6 years ago

When you read in the file, you should add the header=T argument since you have a row of column headers. I'm generally confused by why you are loading 2 files at the beginning of your code. Shouldn't you just load one file with the row.names = 1 and header=T arguments? When I look at the silo3and9.csv it doesn't look like NSAF data.

Despite these quibbles, I don't see anything wrong with your further down. I'm having trouble downloading your csv file from github. Do you want to send it to me and see if I replicate your error?

kaitlynrm commented 6 years ago

Thanks @kubu4! That's how I was shown to do it quickly in my meeting with Steven. I figured it was easy in R but left it as is since we edited it then. Also, I would've combined the data in R instead of pasting it in R but I couldn't find a code to do that. Rbind and cbind didn't work and merge requires a similar column (the same as the other bind commands I believe) which doesnt produce the final dataframe I want, nor would it work since the detected proteins in each silo aren't identical. Those were the only codes I found to combine data but another code suggestion would be great for future use!

@emmats I loaded in silo#.detected so that I could make a data frame with 2 columns, first with proteins second with cluster assignment, after creating the cluster matrix. I couldn't figure out how to make header row names as a column. When we chatted, you mentioned you did it in excel so I wrote it that way to try to make it reproducible since I ran it multiple times. I'm sure it could be cleaned up but it's organized the data appropriately for the rest of the code so I left it as is.

The data is from the same data sheet that enrichment and NMDS charts have been made from. I only have 1 data sheet which I was under the impression that it is spectral counts from Abacus. I think you might have mentioned this before though and it brought up the question of where the original or intermediate files might be and we couldn't locate them.

I was trying to read up on bray-curtis vs eucledian and the methods for hclust but it didnt look like the other methods made sense for the type of data and analyses I wanted to do so i didnt try another method in the code. Could it be that using 14,000 proteins is too complex for hclust to sort?

I'll email the table to you too @emmats, thanks!

kubu4 commented 6 years ago

I'm having trouble downloading your csv file from github.

@emmats - Clone (or download) the entire repo - that's the easiest way. Go here and look for the green button towards the right side of the screen.

but another code suggestion would be great for future use

@kaitlynrm - Here's an example on how to concatenate two datasets (using R):

Pretend you're in your silo9 directory in your repo and then run the following command.

system("cat ../silo3/silo3.csv >> ../Silo3_and_9/silo3_9.csv") system("cat silo9.csv >> ../Silo3_and_9/silo3_9.csv")

Here's what's going on:

  1. system() allows you to use shell commands within R.

  2. cat is the shell command to print file contents .

  3. >> is a shell command to redirect output of the cat command and append to the second file. If the second file doesn't exist, it will be created (this is the case in the first system() command.

  4. silo3_9.csv now contains all of the orginal silo3.csv data and silo9.csv data.

Important note - If your silo9.csv file has a row of column names, you'll want to strip that off prior to running the second system() command.

emmats commented 6 years ago

It's a file format problem. There is something weird about the format of your input file. When I copied and pasted all the values to a new excel workbook and saved as a .csv everything worked fine. Also, I suggest the following changes to the beginning of your R script:

Load in spectral count data data

silo3_9 <- read.csv("silo3_9.csv", header=T)

remove contaminant proteins

proteins3_9<-subset(silo3_9, grepl(paste('CHOYP', collapse="|"), silo3_9$Protein))

rownames(proteins3_9)<-proteins3_9$Protein prot3_9<-proteins3_9[,2:9]

Your file for subsequence analyses will be prot3_9

kaitlynrm commented 6 years ago

Why is everything that isn't a "CHOYP" labelled protein considered a contaminant @emmats ?

emmats commented 6 years ago

I think that all the proteins from the oyster proteome are called "CHOYP.....". We add protein sequences of common lab contaminants into our database for searches so that we don't try to force oyster IDs on human keratin, etc. So I remove those non-oyster proteins from analyses.

kaitlynrm commented 6 years ago

Thanks @emmats ! I might need to rerun my other clusters then- it looks like there are 23 contaminant proteins. Also,header = TRUE is default for read.csv() so I didn't include it in the code.

@kubu4 is there a way to either add "3" and "9" into the silo3 and silo9 tables in the terminal, or is there a way to concatenate the two dataframes in R? I can't concatenate the two dataframes until the silos are delimited in front of the protein names, and I only know how to add those characters to the column in R. I don't want to start trying codes in the terminal since it is messing with my files directly (also since I don't understand how to do things in the terminal).

kubu4 commented 6 years ago

Here's how to use the terminal commands in your R script for reproducibility. Here's an example to append text to beginning of first field in your files and concatenate data to new file. You'll have to user your newly acquired understanding of relative paths to figure out how to get the data concatenated in to the same file. :)

system("awk '$1="3_"$1' silo3.csv >> silo3_9.csv") system("awk '$1="9_"$1' silo9.csv >> silo3_9.csv")

Here's what's happening:

  1. system() allows you to use shell commands within R.

  2. awk is a shell program for manipulating text files.

  3. '$1="3_"$1' are the arguments needed for awk to operate. This tells awk to append 3_ to the beginning of the first field ($1). Since awk operates on a line-by-line basis, this command is executed for each line in the input file.

  4. silo3.csv is the input file that awk should work on.

  5. >> silo3_9.csv appends the output of the awk command to a file called silo3_9.csv. If that file doesn't exist, it will be created.

I don't want to start trying codes in the terminal since it is messing with my files directly

Don't be afraid to mess with files in terminal! The actions you run don't actually modify your input files, unless you explicitly tell them to. Virtually all commands print the results to the screen. A notable exception to this rule is the rm command. This will immediately delete your file(s).

kaitlynrm commented 6 years ago

I tried to run system("awk '$1 = "3_"$1' silo3/silo3.csv >> Silo3_9/silo3_9.csv) when I was in my /kmeans directory, but I kept recieving the error unexpected numeric constant in "system("awk '$1 = "9" . I tried to modify it like this system("awk '{$1 = "3_" $1}1' silo3/silo3.csv >> Silo3_9/silo3_9.csv according to this source, but it gives me the same error.

kubu4 commented 6 years ago

Formatting, including spaces, are critical. Compare spaces in my code (or, lack of) with yours.

I'll provide more explicit details later today.

kubu4 commented 6 years ago

Here's the fix (sorry, I neglected to include a very important set of backslashes!). The commands below assume you're in your repo's root directory:

system("awk '$1=\"3_\"$1' analysis/kmeans/silo3/silo3.csv >> analysis/kmeans/Silo3_and_9/silo3_9.csv")

system("awk '$1=\"9_\"$1' analysis/kmeans/silo9/silo9.csv >> analysis/kmeans/Silo3_and_9/silo3_9.csv")

The two commands above will do what what you want.

Sorry, I didn't test them before (was on mobile)!

Anyway, adding the backslashes do something called "escaping" the character that follows the backslash. They are needed to tell the computer that the quotation marks should not be interpreted as part of the code.

In my previous example, since the quotations weren't escaped, the computer saw the command as:

system("awk '$1="

This is because the computer saw the first quotation mark (before awk) and the closing quotation mark (after the =). This tells the computer that the command between those quotation marks is the command to run. However, that results in an incomplete awk command, which generated the error.

Sorry about that!

kaitlynrm commented 6 years ago

No worries @ kubu4! I also had to specify the absolute path or else I got the error sh: 1: cannot create /Silo3_and_9/silo3_9.csv: Directory nonexistent (or if I did the absolute path of the first file but not the second I would get awk: cannot open /silo3/silo3.csv (No such file or directory)).

I modified the code to create silo3_9 then removed contaminant proteins. However, I got the same error as before Error: !is.unsorted(ht) is not TRUE.

I copied and saved silo3_9-edited.csv to a new spreadsheet and still received the same error.

kubu4 commented 6 years ago

I also had to specify the absolute path or else I got the error sh: 1: cannot create /Silo3_and_9/silo3_9.csv: Directory nonexistent

FYI: /Silo3_and_9/silo3_9.csv is specifying an absolute path to a file.

An absolute path is any path that begins with a /. This tells the computer to start looking for folders/files from the root directory of the computer (i.e. the very "top" of the file system).

So, these two paths are not the same:

I believe you wanted the former. Alternatively, you could've used the fixed code I provided above without modification, as long as you were in the top directory of your repo (also referred to as the root directory of your repo).

sr320 commented 6 years ago

@kaitlynrm Class is at 3pm :)

kubu4 commented 6 years ago

When I was running your updated code, I received the following error when I ran:

rownames(proteins3_9)<-proteins3_9$Protein:

Did you get this error? I can't get beyond this line to even get to the point of testing coef.hclust(clust.avg).

kaitlynrm commented 6 years ago

No I did/do not get that error. Did you clone the repo and/or get the silo3_9 tables I created? If so, the rows are duplicating in the silo3_9 file from the system(awk...) code. You will have to delete those tables first or modify the system(awk..) code to create a new file before you can continue or else the table you are running through the code has the data from silo3 and silo9 twice.

@sr320 I wish I could go! I would have to come in late, around 3:30...

kubu4 commented 6 years ago

Yes, I cloned your repo, and deleted all CSV files from your Silo3_and_9 directory so that it would be a fresh start.

Then, I went through the code.

kaitlynrm commented 6 years ago

I just re-ran it the same way without any CSV files and did not get any error there.

kubu4 commented 6 years ago

@kaitlynrm helped resolve discrepancy between my cloned repo and what was on GitHub. For some reason, my git pull seemed to get an outdated version of here repo...

No more rownames() error.

kaitlynrm commented 6 years ago

@kubu4 Those files have never had more than the ~7500 proteins in them so pulling from an outdated version shouldn't have affected the code. It looked like something happened that produced a duplicate of the contents of silo3.csv and silo9.csv.

kubu4 commented 6 years ago

The problem was related to me making changes to my git repo during my testing from last week and not getting rid of those changes. Thus, the reason that git pull functioned normally, but still exhibited discrepancies between GitHub and my local version of the repo. I should've run git status and I would've seen the problem.

kubu4 commented 6 years ago

@emmats - Can you please elaborate on exactly what you did here:

It's a file format problem. There is something weird about the format of your input file. When I copied and pasted all the values to a new excel workbook and saved as a .csv everything worked fine.

Copied/pasted which specific files?

Did you rename columns/rows in while in Excel or did you do this part in R?

kaitlynrm commented 6 years ago

@emmats What's the rationale for using Bray-curtis over euclidean distances (and the average method for hclust)?

emmats commented 6 years ago

I use bray-curtis because I had a conversation with Julian years ago and it is what we decided to use for these types of data sets. I don't remember the specific details as to why. The average method seemed like a good choice (here's a brief description - https://www.saedsayad.com/clustering_hierarchical.htm). You can always test the other methods if you think one of them is a better fit for the data. I have found that there are not huge differences between most of them.

kubu4 commented 6 years ago

@emmats - Can you please elaborate on exactly what you did here:

It's a file format problem. There is something weird about the format of your input file. When I copied and pasted all the values to a new excel workbook and saved as a .csv everything worked fine.

Copied/pasted which specific files?

Did you rename columns/rows in while in Excel or did you do this part in R?

emmats commented 6 years ago

I renamed the file columns in excel because having numbers as file headers is a little weird to me, but I don't think it really matters. When I was playing around with Kaitlyn's file of "spec counts" - the main input file for her analysis - I noticed that it wasn't really saved as a .csv file that I am used to. I'm not sure exactly what it was. I copied all the cells with values in them from the original input file and pasted into a new excel workbook, which I saved as a .csv. When I read in this new file, I did not get any errors in the script. I think her original file format coded the values as something other than the numerical values they were, but I'm not sure exactly what was going on.

kubu4 commented 6 years ago

Can you please provide specific file names? In this particular issue, she is referring to three different files, so I'm not sure which file you're referring to here:

the main input file for her analysis

Also, can you clarify (i.e. what tipped you off) this statement:

I noticed that it wasn't really saved as a .csv file that I am used to

emmats commented 6 years ago

The silo3_9.csv. I'm still not clear as to why there were multiple input files...It is the file with days as column header, proteins as row names, and some kind of value that we are assuming are averaged spec counts in the cells. Originally, with Kaitlyn's file, I was getting the same error she was. I was playing with the file in excel, chose save as when I changed something, and noticed that excel wanted to save it as some weird file format. So I decided to try to start fresh and save it as a .csv and it worked.

kubu4 commented 6 years ago

@kaitlynrm -

I'm interested in reproducing @emmats file changes, but your repo doesn't have a silo3_9.csv file that dates back further than 9/28/2015 (@emmats' initial comment about identifying formatting issues is from 9/26/2018; see screen cap of repo history below).

Am I missing something? It also appears that you may have ended up emailing a file to her:

I'll email the table to you too

Can you please forward me the same table (just forward me the email you sent to her)?

selection_073

--edited to add date for Emma's initial comment---

kaitlynrm commented 6 years ago

I removed that file from my repo and recreated it in R @kubu4. I added it back in now though!

However, after talking with @yaaminiv and reviewing the multivariate stats class material (FISH 560), I think I should use euclidean instead of bray-curtis. Bray-curtis is an asymmetrical distance matrix so the double zero values in my data will be excluded, but that is valuable information since 0 values are part of the pattern of abundances I am attempting to analyze and cluster. Euclidean is a symmetrical distance matrix so it will include those values in the matrix. @yaaminiv said that Julien told her to use Euclidean distances for her data as well.