NanoStringAssay class constructor fails on exprs() when geneid is not unique

RGLab / MAST

Tools and methods for analysis of single cell assay data in R

226 stars 57 forks source link

NanoStringAssay class constructor fails on exprs() when geneid is not unique #7

Closed gfinak closed 11 years ago

gfinak commented 11 years ago

Lucas mentioned that some genes were not unique and should be summed to get a count within the lane. Calling exprs() on a NanoStringAssay object tries to construct a matrix of columns = number of genes (or maybe rows). This does not agree with the unique number of genes and errors out.

We should either sum over the replicates before constructing the object, or define a unique primer id , or change the way exprs() gets the number of rows to take replicate genes into account.

raphg commented 11 years ago

Well, I think it's ok to have duplicates (if the primer are different), as long as the primers are common to all samples. This is basically how probe based gene expression data are stored (e.g. affymetrix). Multiple probes may correspond to the same gene.

Am I missing something?

Greg Finak mailto:notifications@github.com February 22, 2013 8:39 PM

Lucas mentioned that some genes were not unique and should be summed to get a count within the lane. Calling exprs() on a NanoStringAssay object tries to construct a matrix of columns = number of genes (or maybe rows). This does not agree with the unique number of genes and errors out.

We should either sum over the replicates before constructing the object, or define a unique primer id , or change the way exprs() gets the number of rows to take replicate genes into account.

— Reply to this email directly or view it on GitHub https://github.com/RGLab/SingleCellAssay/issues/7.

Raphael Gottardo, Associate Member www: http://www.rglab.org http://www.rglab.org/ Phone: 206-667-4076 Fred Hutchinson Cancer Research Center Vaccine and Infectious Disease Division Public Health Sciences Division

gfinak commented 11 years ago

I agree duplicates are fine. But we need a unique id for each observation.

Greg Finak greg.finak@gmail.com

On Feb 22, 2013, at 8:50 PM, raphg notifications@github.com wrote:

Well, I think it's ok to have duplicates (if the primer are different), as long as the primers are common to all samples. This is basically how probe based gene expression data are stored (e.g. affymetrix). Multiple probes may correspond to the same gene.

Am I missing something?

Greg Finak mailto:notifications@github.com February 22, 2013 8:39 PM

Lucas mentioned that some genes were not unique and should be summed to get a count within the lane. Calling exprs() on a NanoStringAssay object tries to construct a matrix of columns = number of genes (or maybe rows). This does not agree with the unique number of genes and errors out.

We should either sum over the replicates before constructing the object, or define a unique primer id , or change the way exprs() gets the number of rows to take replicate genes into account.

— Reply to this email directly or view it on GitHub https://github.com/RGLab/SingleCellAssay/issues/7.

Raphael Gottardo, Associate Member www: http://www.rglab.org http://www.rglab.org/ Phone: 206-667-4076 Fred Hutchinson Cancer Research Center Vaccine and Infectious Disease Division Public Health Sciences Division — Reply to this email directly or view it on GitHub.

raphg commented 11 years ago

We could just add a .1, .2, .3 to the gene names if needed.

Greg Finak mailto:notifications@github.com February 22, 2013 8:56 PM I agree duplicates are fine. But we need a unique id for each observation.

Greg Finak greg.finak@gmail.com

On Feb 22, 2013, at 8:50 PM, raphg notifications@github.com wrote:

Well, I think it's ok to have duplicates (if the primer are different), as long as the primers are common to all samples. This is basically how probe based gene expression data are stored (e.g. affymetrix). Multiple probes may correspond to the same gene.

Am I missing something?

Greg Finak mailto:notifications@github.com February 22, 2013 8:39 PM

Lucas mentioned that some genes were not unique and should be summed to get a count within the lane. Calling exprs() on a NanoStringAssay object tries to construct a matrix of columns = number of genes (or maybe rows). This does not agree with the unique number of genes and errors out.

We should either sum over the replicates before constructing the object, or define a unique primer id , or change the way exprs() gets the number of rows to take replicate genes into account.

— Reply to this email directly or view it on GitHub https://github.com/RGLab/SingleCellAssay/issues/7.

Raphael Gottardo, Associate Member www: http://www.rglab.org http://www.rglab.org/ Phone: 206-667-4076 Fred Hutchinson Cancer Research Center Vaccine and Infectious Disease Division Public Health Sciences Division — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/RGLab/SingleCellAssay/issues/7#issuecomment-13985150.

Greg Finak mailto:notifications@github.com February 22, 2013 8:39 PM

Lucas mentioned that some genes were not unique and should be summed to get a count within the lane. Calling exprs() on a NanoStringAssay object tries to construct a matrix of columns = number of genes (or maybe rows). This does not agree with the unique number of genes and errors out.

We should either sum over the replicates before constructing the object, or define a unique primer id , or change the way exprs() gets the number of rows to take replicate genes into account.

— Reply to this email directly or view it on GitHub https://github.com/RGLab/SingleCellAssay/issues/7.

gfinak commented 11 years ago

Yes, I think there's even a function for that: make.unique().

Greg Finak greg.finak@gmail.com

On Feb 22, 2013, at 8:58 PM, raphg notifications@github.com wrote:

We could just add a .1, .2, .3 to the gene names if needed.

Greg Finak mailto:notifications@github.com February 22, 2013 8:56 PM I agree duplicates are fine. But we need a unique id for each observation.

Greg Finak greg.finak@gmail.com

On Feb 22, 2013, at 8:50 PM, raphg notifications@github.com wrote:

Well, I think it's ok to have duplicates (if the primer are different), as long as the primers are common to all samples. This is basically how probe based gene expression data are stored (e.g. affymetrix). Multiple probes may correspond to the same gene.

Am I missing something?

Greg Finak mailto:notifications@github.com February 22, 2013 8:39 PM

Lucas mentioned that some genes were not unique and should be summed to get a count within the lane. Calling exprs() on a NanoStringAssay object tries to construct a matrix of columns = number of genes (or maybe rows). This does not agree with the unique number of genes and errors out.

We should either sum over the replicates before constructing the object, or define a unique primer id , or change the way exprs() gets the number of rows to take replicate genes into account.

— Reply to this email directly or view it on GitHub https://github.com/RGLab/SingleCellAssay/issues/7.

Raphael Gottardo, Associate Member www: http://www.rglab.org http://www.rglab.org/ Phone: 206-667-4076 Fred Hutchinson Cancer Research Center Vaccine and Infectious Disease Division Public Health Sciences Division — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/RGLab/SingleCellAssay/issues/7#issuecomment-13985150.

Greg Finak mailto:notifications@github.com February 22, 2013 8:39 PM

Lucas mentioned that some genes were not unique and should be summed to get a count within the lane. Calling exprs() on a NanoStringAssay object tries to construct a matrix of columns = number of genes (or maybe rows). This does not agree with the unique number of genes and errors out.

We should either sum over the replicates before constructing the object, or define a unique primer id , or change the way exprs() gets the number of rows to take replicate genes into account.

— Reply to this email directly or view it on GitHub https://github.com/RGLab/SingleCellAssay/issues/7.

— Reply to this email directly or view it on GitHub.

amcdavid commented 11 years ago

2b7408f breaks "split"

amcdavid commented 11 years ago

75f7c20 we throw an error now at construction. In general, I don't think we can guess what's wrong with the data.frame that's provided (fix in 2b7408f assumes the ordering of the 'extra' primerids is always the same), so I'm reluctant to try to automatically try to fix things.

I suggest we add a helper method specific to Nanostring (or its constructor) that handles this situation, if there's a consistent structure there.

In any case, to use the fix in 2b7408f, we can't just assume that primerid is either set or NULL (the formal argument in SingleCellAssay constructor), because it can be set in the Mapping, which is what happens when split is called.

gfinak commented 11 years ago

This is not a fix, Andrew. If primerid has duplicates, we should deduplicate. The real issue is that you are calling unique on the primerid mapping in exprs() for SingleCellAssay.

I'm going to work on this in another branch and merge to the master when I'm done. Any consequent problems related to this should be opened as another issue.

I can handle the case of split that you've described.

amcdavid commented 11 years ago

I am not encountering the problem with non-unique primer IDs, because Name (has form pBB:XXX) is unique.

nsa <- NanoStringAssay(files, paste(pathToExp, 'h9key.csv', sep=''), idvars=c('ID', 'CartridgeID'), primerid='Name', measurement='Count', geneid='Name', featurevars='GeneID', cellvars=c('BindingDensity', 'Date', 'FovCount', 'FovCounted') ) works just fine.

I believe our original semantics was that primerid would be unique, while geneid can be any identifier, so we already sort of thought of this issue.

Rather than adding suffixes, I think we should add functionality to allow comparing the quality of primers within a gene, and maybe averaging if they all appear of acceptable quality.

I am also confused about how you would know which pBB:XXX corresponds to which suffix you add onto the GeneID without knowing pBB:XXX, since the merge happens per code file.

I made a bunch of changes to NanostringRCCReader late last night which I'm about to push, and I'll be in in about an hour.

-Andrew

2013/2/26 Greg Finak notifications@github.com

This is not a fix, Andrew. If primerid has duplicates, we should deduplicate. The real issue is that you are calling unique on the primerid mapping in exprs() for SingleCellAssay.

I'm going to work on this in another branch and merge to the master when I'm done. Any consequent problems related to this should be opened as another issue.

I can handle the case of split that you've described.

— Reply to this email directly or view it on GitHubhttps://github.com/RGLab/SingleCellAssay/issues/7#issuecomment-14119565 .

gfinak commented 11 years ago

Yes, I agree, but if someone puts in the geneid (non unique) as the primerid (should be unique) then we should deduplicate. Summarizing over genes is still possible and should be done using the geneid or primerid. In any case we only deduplicate if the geneid==primerid

I've just about got this fixed and passing all the tests. On Feb 26, 2013 11:45 AM, "Andrew McDavid" notifications@github.com wrote:

I am not encountering the problem with non-unique primer IDs, because Name (has form pBB:XXX) is unique.

nsa <- NanoStringAssay(files, paste(pathToExp, 'h9key.csv', sep=''), idvars=c('ID', 'CartridgeID'), primerid='Name', measurement='Count', geneid='Name', featurevars='GeneID', cellvars=c('BindingDensity', 'Date', 'FovCount', 'FovCounted') ) works just fine.

I believe our original semantics was that primerid would be unique, while geneid can be any identifier, so we already sort of thought of this issue.

Rather than adding suffixes, I think we should add functionality to allow comparing the quality of primers within a gene, and maybe averaging if they all appear of acceptable quality.

I am also confused about how you would know which pBB:XXX corresponds to which suffix you add onto the GeneID without knowing pBB:XXX, since the merge happens per code file.

I made a bunch of changes to NanostringRCCReader late last night which I'm about to push, and I'll be in in about an hour.

-Andrew

2013/2/26 Greg Finak notifications@github.com

This is not a fix, Andrew. If primerid has duplicates, we should deduplicate. The real issue is that you are calling unique on the primerid mapping in exprs() for SingleCellAssay.

I'm going to work on this in another branch and merge to the master when I'm done. Any consequent problems related to this should be opened as another issue.

I can handle the case of split that you've described.

— Reply to this email directly or view it on GitHub< https://github.com/RGLab/SingleCellAssay/issues/7#issuecomment-14119565> .

— Reply to this email directly or view it on GitHubhttps://github.com/RGLab/SingleCellAssay/issues/7#issuecomment-14124956 .

gfinak commented 11 years ago

I've pushed a branch primerfix. Check it out and test it. On my end it passes all tests. Let me know if it works for you, I'll merge it.

amcdavid commented 11 years ago

Tested the branch and it doesn't break one of my bigger Sweave scripts. Thanks for adding the unit tests, we should stay in the habit when fixing bugs to make sure we don't introduce regressions.

I suggest we throw a warning if there are duplicates, because making the primerids unique still assumes that within each id.var the duplicated geneids always appear in the same order in the dataframe. If they aren't then we'll end up with meaningless results, so the user should know this rather than things silently being scrambled.

2013/2/26 Greg Finak notifications@github.com

I've pushed a branch primerfix. Check it out and test it. On my end it passes all tests. Let me know if it works for you, I'll merge it.

— Reply to this email directly or view it on GitHubhttps://github.com/RGLab/SingleCellAssay/issues/7#issuecomment-14128910 .

gfinak commented 11 years ago

They should appear in the same order, I think ddply sorts things.. but we could sort as well. I'll add that change and merge. Greg Finak greg.finak@gmail.com

On Feb 26, 2013, at 1:31 PM, Andrew McDavid notifications@github.com wrote:

Tested the branch and it doesn't break one of my bigger Sweave scripts. Thanks for adding the unit tests, we should stay in the habit when fixing bugs to make sure we don't introduce regressions.

I suggest we throw a warning if there are duplicates, because making the primerids unique still assumes that within each id.var the duplicated geneids always appear in the same order in the dataframe. If they aren't then we'll end up with meaningless results, so the user should know this rather than things silently being scrambled.

2013/2/26 Greg Finak notifications@github.com

I've pushed a branch primerfix. Check it out and test it. On my end it passes all tests. Let me know if it works for you, I'll merge it.

— Reply to this email directly or view it on GitHubhttps://github.com/RGLab/SingleCellAssay/issues/7#issuecomment-14128910 .

— Reply to this email directly or view it on GitHub.

gfinak commented 11 years ago

Fixed.