Error merging biological replicates with Skyline data

yaaminiv commented 6 years ago

From #19

@laurahspencer: Error merging biological data to abundance data; possibly b/c the column name in biol. data set isn't exactly "Sample.Number"

After changing that column's name to "Sample.Number" I executed the merge, but dataframe is empty; I think it's b/c your X db doesn't include sample numbers w/o replicates:

Tried on my computer, worked just fine. Could be an issue with different computers? @sr320 and @grace-ac will test.

kubu4 commented 6 years ago

For merge to function properly, don't you have to specify the by portion of the function like so:

by.x="column_name", by.y="column_name"

Not sure if you need both a by.x and a by.y, so you could maybe just use one or the other?

yaaminiv commented 6 years ago

I do specify the by portion! Just got cutoff in Laura's screenshots.

Here's my code (found in this R script):

masterSRMDataBiologicalReplicates <- merge(x = masterSRMData, y = biologicalReplicates, by = "Sample.Number")

Where "Sample.Number" refers to the column name.

kubu4 commented 6 years ago

My understanding is that you're usage is incorrect. It should be:

masterSRMDataBiologicalReplicates <- merge(x = masterSRMData, y = biologicalReplicates, by.y = "Sample.Number")

yaaminiv commented 6 years ago

Here's my understanding:

From R Documentation:

By default the data frames are merged on the columns with names they both have, but separate specifications of the columns can be given by by.x and by.y. The rows in the two data frames that match on the specified columns are extracted, and joined together. If there is more than one match, all possible matches contribute one row each. For the precise meaning of ‘match’, see match.

So by would be the general case, and would still work (that's how I was taught in R class!)

Either way, I don't think this argument is what's causing Laura's issues?

kubu4 commented 6 years ago

OK, there's something weird going on here. When I follow @yaaminiv link to her script that she provided above, it's not the same script that @laurahspencer's using.

For example, see Line 25 of @yaaminiv's script (quick screen cap below) and compare that to line 25 in @laurahspencer's screenshots above - not the same!

Am I missing something or is there confusion on which script (or version of the script) is being evaluated here?

selection_100

laurahspencer commented 6 years ago

I played around with the script (on my local computer) when working through her protocol to view data frames and fix an error I got so I could move forward with the work flow.

laurahspencer commented 6 years ago

But my changes were basically View(), head(), etc.

kubu4 commented 6 years ago

I ran through @yaaminiv script just past where @laurahspencer got her error and I get no errors.

RStudio Version 1.0.143.

yaaminiv commented 6 years ago

I used the same R studio version, R version 3.4.0 (2017-04-21) -- "You Stupid Darkness". Is there a potential version issue between our computers and Woodpecker (what Laura's using to reproduce my analyses)?

laurahspencer commented 6 years ago

It must be a Windows thing. Same R version (3.4.0, 2017-04-21) & RStudio version:

I just tried re-downloading all materials & rerunning code on the Windows computer, same error: snip20171016_1

I'll also try running my code on the Windows machine too and see if I encounter errors.

kubu4 commented 6 years ago

Adding support to @laurahspencer's Windows experience.

I also get this error when running on Windows 7 (R v3.4.2; RStudio 1.1.383).

kubu4 commented 6 years ago

I thought I had this figured out, but not dice. However, here's some insight into what's causing the issue. @laurahspencer actually alluded to this in her screen caps, but I'm not sure if she was highlighting the actual problem or just the column name. Anyway, the cause of the issue is a weird character set inserted in the "Sample.Name" column in the 2017-09-06-Biological-Replicate-Information.csv file:

20171016_weird_symbol

Additionally, when I try to preview that file using the head command using Git Bash, it only displays the very last line of the file:

1071016_last_line_only

When I view this in the text editor that I use (Notepad++), with "view all characters", I don't see any weird characters or anything, but I did notice that the last line of the file does NOT have a carriage return after it:

20171016_no_cr

I think a "valid" text file has to end with a newline (which might not be the same as a carriage return?), so maybe this is the issue? Will investigate a bit more.

kubu4 commented 6 years ago

OK, here's the immediate fix to this specific issue. Line 22 should be:

biologicalReplicates <- read.csv("2017-09-06-Biological-Replicate-Information.csv", na.strings = "N/A", fileEncoding="UTF-8-BOM")

Specifying the file encoding as UTF-8-BOM is needed for this particular file.

However, there is a bigger issue here; how did this file get this way? The answer most likely lies in the CSV being generated by Excel for Mac. I think you have to make sure that the Format selected when saving as CSV is "Windows Comma Separated (.csv)". This should ensure cross-platform functionality.

@yaaminiv please test this when you have the chance and report back.

yaaminiv commented 6 years ago

I'll make the edit to the code ASAP. I'm not working on a Windows right now but I think @laurahspencer is and could see if this fixes our issues?

And good to know about the Windows Comma Separated .csv! I'll make adjustments.

kubu4 commented 6 years ago

To clarify, I know the code change fixes the issue - I tested it on Windows.

We'll need you to test out the change in file saving procedure and see if the "old" code works with the file when saved using the "Windows Comma Separated (.csv)" option.

yaaminiv commented 6 years ago

Oh, gotcha! I was able to run through my entire script with no errors.

RobertsLab / project-oyster-oa

Error merging biological replicates with Skyline data #23

After changing that column's name to "Sample.Number" I executed the merge, but dataframe is empty; I think it's b/c your X db doesn't include sample numbers w/o replicates: