Bioconductor / VariantAnnotation

Annotation of Genetic Variants
https://bioconductor.org/packages/VariantAnnotation
26 stars 20 forks source link

REF ALT QUAL FILTER fields can no longer be modified via rowRanges() #6

Closed d-cameron closed 6 years ago

d-cameron commented 6 years ago

Assigning values to one of these field should updated the value in the vcf object. It does not.

Example code:

vcf <- VCF(rowRanges=GRanges(seqnames="chr1", ranges=IRanges(start=1, width=1)), fixed = DataFrame(QUAL=1))
rowRanges(vcf)$QUAL <- 7
all(rowRanges(vcf)$QUAL == 7)

Version: VariantAnnotation_1.24.5 Expected behaviour: QUAL field is updated to 7 for all rows. Actual behaviour: nothing happens.

Previous versions of VariantAnnotation behaved as expected. The change log indicates:

'rowRanges<-' and 'mcols<-' on VCF class behave as they do on RangedSummarizedExperiment

I suspect this bug is a regression associated with that change. Note that it is only the fields in the fixed slot that have become immutable via the rowRanges() accessor. Other fields behave as expected.

vobencha commented 6 years ago

Hi, This change was intentional. The man page for VCF class was updated when the behavior changed to clarify that fixed fields should be modified via fixed<-.

       \code{mcols(rowRanges(x)) <- value}. This method does not manage the
       fixed fields, 'REF', 'ALT', 'QUAL' or 'FILTER'. To modify those
       columns use \code{fixed<-}.

Maybe this isn't clear enough. I've just modified the VCF class man page in devel to explicitly state that rowRanges<- won't work.

    \item \code{rowRanges}: A \link{GRanges}-class instance defining the
           variant ranges and associated metadata columns of REF, ALT, QUAL 
           and FILTER. While the REF, ALT, QUAL and FILTER fields can
           be displayed as metadata columns they cannot be modified with
           \code{rowRanges<-}. To modify these fields use \code{fixed<-}. 
d-cameron commented 6 years ago

Thanks for the response. Would it be possible to raise an error when an attempt to assign to one of these fields is made via rowRanges()? All existing code that uses assignment via rowRanges() is now silently broken and no longer performing as intended. I feel it make much more sense to actually throw an error and force a change, than to silently return (from the users' perspective) the incorrect result.

Out of curiosity, what was the reasoning behind disallowing updates via rowRanges(), but still including those fields in the rowRanges() output?

d-cameron commented 6 years ago

In my case, I know that I have some code that updates arbitrary rowRanges columns via string accessors so it's very difficult to find all the locations that use fixed fields.

'rowRanges<-' and 'mcols<-' on VCF class behave as they do on RangedSummarizedExperiment

Operations such as mcols(rowRanges(vcf))[[colname]] on VCF files behave quite differently to RangedSummarizedExperiment as I need to add special case code when when colname %in% c("REF", "ALT", "QUAL" or "FILTER").

If I can't convince you to revert this change, would I be able to simulate the existing behaviour by intercepting the rowRanges() assignment with my own function, and either passing though to rowRanges() or dispatching to fixed() based on the column being assigned to, or is this not possible because the rowRanges() and DataFrame assignments are different operators?

vobencha commented 6 years ago

Hi Daniel,

I'm sorry to hear this has caused problems for you. I wanted to give a little more background on why the change happened and when.

Why the change was made:

The fixed fields in the VCF class are actually stored in a separate slot called 'fixed' and not the 'rowRanges' slot. They are, however, optionally displayed as part of the mcols in the rowRanges(). This was not a good design decision because this gives the impression that the fixed fields are indeed part of what's extracted with rowRanges() and can be modified with the usual operators that modify a GRanges object. We went through several iterations of modifying rowRanges<- and mcols<- to manage these fixed fields but the code remained fragile and buggy. Trying to manage one slot through another slot's accessors was a big problem. All of this meant we could not reuse the standard behavior of rowRanges<- and mcols<- and when upstream changes were made to those functions it became a maintenance nightmare for the VCF-specific methods in VariantAnnotation.

Your example above demonstrates exactly this point. You are trying to modify a fixed field with accessors for a GRanges. The fixed fields are not part of the GRanges object in the rowRanges slot but are instead held in a DataFrame in a fixed slot. The are only appended to the GRanges mcols at the last minute in the show() method.

When the behavior changed:

This change happened in July 2017:

https://github.com/Bioconductor/VariantAnnotation/commit/da29aaa484cc30c685f3ded7be5808e0857e4b5a

At that time, an error was thrown if you tried to replace any of the fixed fields with the rowRanges<- or mcols<- replacement methods on a VCF object.

I don't remember the details of the discussion but I believe the hard stop (error) was considered too disruptive and the error message was removed in Nov 2017:

https://github.com/Bioconductor/VariantAnnotation/commit/f3fa4217a74b5fe2004368d72e9195ecf9e23d9b

Any code using devel VariantAnnotation between July-Oct 2017 would have thrown an error; the same for release and devel VariantAnnotation between Oct-Nov 2017. After Nov 2017 neither release or devel threw the error.

I'm not quite sure how you escaped the error messages and just recently became aware of this change. Given the history, I'm not planning to revert this change. I did not formally deprecate/defunct this aspect of the behavior as I maybe should have. Given that, I'm willing to put the message back (warning this time instead of error) to continue to alert users to the change for the remainder of this devel cycle. The warning would be removed after the April release.

Hopefully you can understand why we made the change and how separating management of the fixed fields and rowRanges makes the code base more robust going forward. Let me know if you think the added warning would be helpful for your current situation or not.

As for 'simulating the existing behavior', you would need to go back to VariantAnnotation <=1.22.* to see that behavior. The replacement methods were set on the VCF class so overriding them with a method of the same name on the same object is not a good idea. If you are set on doing this it would be better to name the function something different and take the source code from an old version of VariantAnnotation. This still requires a global search and replace. And remember, the code for those setters from old versions of VariantAnnotation was buggy and did not behave appropriately in all circumstances.

Valerie

d-cameron commented 6 years ago

Let me know if you think the added warning would be helpful for your current situation or not. Given that, I'm willing to put the message back (warning this time instead of error) to continue to alert users to the change for the remainder of this devel cycle. The warning would be removed after the April release.

Sorry for the extremely delayed response and thank you very much for your comprehensive response. The warning would have be useful but not as much as an error. I've already gone through my codebase so it wouldn't be useful to me personally and it's already April anyway.

I don't remember the details of the discussion but I believe the hard stop (error) was considered too disruptive and the error message was removed in Nov 2017:

The change to disallow assignment of fixed fields through rowRanges is itself the disruptive change and I would argue that the resultant error message is a natural consequence of this decision. If code is now broken then I would have thought it made sense to actually break it. As it stands now, all code that does this now silently returns the incorrect result instead of throw an error. Is the discussion leading up to this decision available anywhere publicly? I'm quite curious as to why silent data corruption was considered preferable to raising an error indicating that the code needs to change to function as intended.

vobencha commented 6 years ago

Unfortunately not all conversations regarding software changes are captured publicly. We do our best to track and document these decisions on the Support Site, bioc-devel, Slack Channels or GitHub issues (depending on where the question was raised). We also get a good deal of email sent to our personal accounts in which case we encourage posting to one of the public forums.

d-cameron commented 6 years ago

Fair enough. Thanks again for your in-depth replies.