SomaLogic / SomaDataIO

The SomaDataIO package loads and exports 'SomaScan' data via the 'SomaLogic Operating Co., Inc.' proprietary data file, called an ADAT ('*.adat'). The package also exports auxiliary functions for manipulating, wrangling, and extracting relevant information from an ADAT object once in memory.
https://somalogic.github.io/SomaDataIO/
Other
26 stars 18 forks source link

Error when using dplyr::group_by() #66

Closed laurapyle closed 10 months ago

laurapyle commented 11 months ago

Hello,

I am rerunning some code that uses SomaDataIO that I haven't run in several months and it does not seem to work as before. I think the issue has to do with dplyr verbs. When I use dplyr::group_by() followed by filter() on a soma_adat, I get the following message:

The object is not a 'soma_adat' class objects: 'grouped_df', 'tbl_df', 'tbl', 'data.frame' Error in filter.soma.adat(., row_number()=1): is_intact_attr(.data) is not TRUE.

I figured out a work-around for that issue, but I applied a custom function to log transform the proteins and got the same message repeated over and over: "Attributes has only 3 entries: 'names', 'row.names', 'class.'

I have searched closed issues on github and found some similar issues, but most of them were quite old and marked as complete. I tried reverting to dplyr version 1.0.6 per #15 but got an error when trying to load SomaDataIO, which needed at least version 1.0.10 of dplyr.

I am not sure why this has happened suddenly - thanks in advance for any help!

Laura Pyle

stufield commented 11 months ago

Hi @laurapyle

Firstly, can you tell me which version of SomaDataIO you are using? CRAN v6.0.0?

Second, Would it be possible to generate a small example that reproduces the error you're experiencing? It doesn't have to be the exact example you're working with, perhaps using the example_data10 dataset that comes with SomaDataIO and grouping (Sex?) with dplyr::group_by().

All of the dplyr S3 methods to a check under the hood using is_intact_attr(), which is where the error is coming from, to ensure that the attributes of the soma_adat object are not corrupted by the dplyr verb. It appears they are here, just not sure exactly when/where this happens. It's possible that a recent change in dplyr has introduced this behavior.

Additionally, with the filter() method is being invoked somewhere (your error), which is a bit confusing. All the more why a reprex would be helpful.

Side note: there are Math.generic S3 methods for the soma_adat class, so logging functions should/can be performed easily (as long as the class is maintained). I would not recommend applying your own custom log-transform function (since the generics exist for this purpose - and have build in checks for edge-case guarding and robustness).

Thank you for submitting an issue ... hopefully we can get this resolved (and fixed if a bug exists).

laurapyle commented 11 months ago

Thanks for your quick reply! I am using SomaDataIO 6.0.0. Here is a reproducible example:

library(dplyr)
library(SomaDataIO)

test <- example_data %>% arrange(SampleId) 
test <- test %>% dplyr::group_by(SampleId) 
test <- test %>% dplyr::filter(row_number()==1)

How would I use the generic s3 methods to log transform the soma_adat object?

stufield commented 11 months ago

Thanks for the example ...

log-transform is simple with the use of our Math generics:

apt <- "seq.3381.24"    # chosen at random
median(example_data[[apt]])
new <- log10(example_data)
median(new[[apt]])

See also: https://somalogic.github.io/SomaDataIO/reference/groupGenerics.html

stufield commented 11 months ago

In the meantime ... can you shed some light on what you are trying to do? I'm thinking there is likely a workaround that doesn't involve dplyr::group_by().

For example, grouping by SampleId is a little unusual ... typically this most useful for SampleType, but I realize this may just be your dummy example.

Thank you for your reprex ... I was able to reproduce your error.

stufield commented 11 months ago

Note for dev ...

the actual bug is here:

gr_df <- dplyr::group_by(example_data, SampleType) 
class(gr_df)
class(gr_df[, -1L])

Behavior is coming from [.soma_adat(), which isn't acting as expected on a "grouped_df" object. Direct call is coming from rn2col(), which uses the [.soma_adat() extraction method.

laurapyle commented 11 months ago

I have a dataset with 2 samples per person at different study visits, and I want to select the first visit for each person, so I arrange by ID and date drawn and then take the first visit. There probably is a way that I can work around this, but I have quite a few separate analyses that use similar logic which would all need to be modified, so I was trying to understand what caused this change in behavior.

stufield commented 11 months ago

Interesting ... I'm not sure there has been a "change" in behavior. The relevant code hasn't changed since well before SomaDataIO was released on CRAN, quite a while actually. The main offender is is_intact_attr() which is called inside dplyr::filter.soma_adat(), but that's actually a red-herring, the real problem is a few steps above where the rownames are preserved. However, I do think this needs attention either way, since a group_by() |> filter() workflow is fairly common.

laurapyle commented 11 months ago

That is very interesting! I've been running this code repeatedly without any errors for over a year, so I assumed something had changed. I am not sure why I started getting an error about a week ago. Although it does explain why I couldn't fix the problem by reverting to older package versions.

stufield commented 11 months ago

Hmmm. The issue is with the grouped_df methods for the dplyr verbs. They are called under the hood indirectly by NextMethod() in any of the verb methods. It's possible those methods have changed and cascaded into our code base that way. Though I still cannot explain why you're suddenly seeing it now unless you were using a different version of dplyr maybe?

laurapyle commented 11 months ago

I don't think that I was using a different version of dplyr although I could be mistaken. I also tried reverting to earlier versions of dplyr and that didn't fix the problem.

laurapyle commented 11 months ago

Actually, I was able to check the history of updates for dplyr and I believe I was using a different version previously. I also don't think that I had successfully reverted to a prior version of dplyr - it couldn't be unloaded because it had been imported by tidyr. So, it's possible that the change in dplyr version is what caused the issue.

stufield commented 11 months ago

Either way, it's worth fixing so that future dplyr changes don't break our class. So thank you for bringing it to my attention.