Bioconductor / VariantAnnotation

Annotation of Genetic Variants
https://bioconductor.org/packages/VariantAnnotation
26 stars 20 forks source link

CollapsedVRanges? #2

Closed hubentu closed 6 years ago

hubentu commented 6 years ago

@lawremi @vobencha @hpages Hi, I am writing a package and want to use a data class very similar to VRanges but needs some modification. Is it possible to add a CollapsedVRanges/ExpandedVRanges, with slots of REF/ALT in the DNAStringSet/DNAStringSetList class? Such a class could be a simple and useful data structure for variants from other formats, such as maf/bed, instead of VCF. Thanks! Qiang

lawremi commented 6 years ago

Sure, it's possible, ideally via pull request ;)

hpages commented 6 years ago

Hi @hubentu, @lawremi, @vobencha, @Liubuntu, Isn't this a situation where extending VRanges might be easier than trying to modify the current VRanges? @hubentu You could implement this extension in your package for now and make the case for inclusion in VariantAnnotation later. H.

hubentu commented 6 years ago

Hi @hpages @lawremi @vobencha @Liubuntu , Sure. I can try to extend VRanges, but the ref/alt slots has been set as "character". To extend means to inherit it, so it can't be changed to XStringSet, right? Maybe I will just define a new class by extending GRanges? Thanks, Qiang

hpages commented 6 years ago

Oh I see. Then a pull request as @lawremi suggested is needed. It looks like union classes character_OR_DNAStringSet and/or character_OR_DNAStringSetList would need to be defined in VariantAnnotation, then the types of the ref/alt slots modified in VRanges class def.

hubentu commented 6 years ago

Thanks @hpages . I think the idea of defining union classes of character_OR_DNAStringSet and character_OR_DNAStringSetList for the ref/alt slots is great and will be more convenient in extending VRanges when needed and inherit the methods.

For now I will look into how to define the union class, and study the scripts of collapsedVCF/expandedVCF and try to write the collapsedVRanges/expandedVRanges.

hpages commented 6 years ago

Providing a little more details here after talking more with @Liubuntu about this.

Using a class union is kind of the usual trick to allow a slot to be of one type or the other. That's because the union class is the parent of the classes that belong to the union. For example if character_OR_XStringSet is defined as the union of character and XStringSet then it's the parent of both. Similarly if character_OR_XStringSet_OR_XStringSetList is defined as the union of character_OR_XStringSet and XStringSetList, then it's the grand-parent of character and XStringSet and the parent of XStringSetList. This allows ExpandedVRanges to specify alt="character_OR_XStringSet" and CollapsedVRanges to specify alt="XStringSetList" if VRanges has alt="character_OR_XStringSet_OR_XStringSetList". VRanges could then become a virtual class with 2 concrete subclasses ExpandedVRanges and CollapsedVRanges with ExpandedVRanges now playing the role of the current VRanges class.

This feels like a lot of contortions though. Another maybe cleaner design would be to remove the ref and alt slots from VRanges and make it a virtual class, and then have ExpandedVRanges and CollapsedVRanges add these slots with the desired type.

But now I see that VariantAnnotation defines a VRangesList class. Couldn't this be used for what you have in mind for CollapsedVRanges?

I'm adding @rcastelo to the discussion.

lawremi commented 6 years ago

Support for XString is not needed to support the storage of multiple alt alleles. While supporting XString might be a future direction, it's orthogonal to this problem, except that it would touch the same slots. There's no need to touch the @ref. Just the @alt needs to become "character_or_CharacterList".

In my experience, dealing with collapsed alts during an analysis is a pain. The head of this issue mentions MAF. If there is a desire to export to MAF in collapsed form, we could solve that problem without fundamentally changing the data structure. I guess we should consider the use cases.

rcastelo commented 6 years ago

Hi,

I do not work with cancer data where MAF/BED formats are used for storing somatic variants but here's my opinion as a user of VRanges with germline variation.

The definition of the VRanges class already suggests in VariantAnnotation/R/AllClasses.R#373 that alleles could be stored as a XStringSet object. This could be a good idea because of the more compact storage and I expect the number of changes involved in the packages using this class would be minimal.

However, considering a DNAStringSetList to store alleles in a VRanges would break, in my opinion, a design decision made on the VRanges class, but I guess @lawremi is the one who can ultimately shed light on this.

As a user of VRanges, my understanding of this class is that it stores only one allele per range on purpose to facilitate the semantics of storing the number of reads supporting an alternative allele (altDepth) and filtering variants with the FilterRules class. Using a DNAStringSetList object to store alleles, as in the CollapsedVCF class would at least complicate the use this semantics within VRanges. So, I guess that a virtual VRanges should not have the slots ref, alt, altDepth, softFilterMatrix and hardFilters and these slots would move to the ExpendedVRanges. The current functionality using the FilterRules class on VRanges objects would also move to the ExpandedVRanges.

However, looking at the TCGA MAF format, I do not see that this format requires storing more than one allele per range in a single column, as it happens with VCF. In fact, the SomaticSignatures package also works with MAF data and seems to be happily using the VRanges class as it is defined now. As far as I now, the BED format does not contemplate either storing more than one single value per row and column. From this perspective, maybe just extending VRanges as it is defined now, to better accomodate MAF data, could be a way forward. Talking to other developers with packages dealing with MAF data, such as SomaticSignatures, could also be an advantage in the long run.

hubentu commented 6 years ago

Hi @rcastelo ,

Thank you for your input. I totally agree with you that ExpandedVRanges is a good alternative for current VRanges, and MAF/bed usually don't store variants with multiple alleles.

However, my colleagues often use the simple bed format or plain text for collapsed alleles. One of my project was to deal with indels/SNPs with multiple alternative alleles and multi-nucleotide variants(MNV) (not from VCF), which requires a method to convert variants between collapsed and expanded. MNV even need a conversion method for neighbor regions. As a user, the CollapsedVRanges/ExpandedVRanges classes are exactly what I would expect.

Thanks and happy new year!

Best, Qiang

lawremi commented 6 years ago

I think it would be a lot simpler for you to just implement a importer/exporter for your bespoke "BED" representation of variants that operates on VRanges as it exists now. That's what we do with VCF now.

Liubuntu commented 6 years ago

Closed for no further response from the request.