NESCent / r-popgen-hackathon

Population Genetics Hackathon, to be held at NESCent on March 16-20, 2015
18 stars 2 forks source link

Project idea: Create package with master classes for population genetic data #8

Open zkamvar opened 9 years ago

zkamvar commented 9 years ago

As mentioned in #4, we have a lot of different classes for handling population genetic data and they are all useful in their own right. One thing to think about is the fact that while the representation of the actual genetic data might be different (frequencies vs. values vs. bitwise), there are basic forms of metadata that is common between all of them: population assignment, individual assignment, etc.

Thinking along the same lines as modular synthesizers, I propose to create a package that defines a class or set of classes and methods that will formally define metadata that is often needed for population genetic analysis. This would allow for easier construction of conversion functions between future classes and result in a more consistent workflow between packages.

I realize that this might fall under the problem of proliferation of standards, but I believe that if we design these to be modular, it should not be an issue.

smhoban commented 9 years ago

Are you thinking kind of like a conversion engine, along the lines of PGDspider (http://www.cmpg.unibe.ch/software/PGDSpider/#Introduction) or Create (https://bcrc.bio.umass.edu/pedigreesoftware/node/2), but of course within R? I think that would be great and I think others would be interested too. Allan Strand and I have talked a bit about this before.

zkamvar commented 9 years ago

I was thinking more along the lines of the data representation within R as opposed to flat file format (That's not to say different file format handlers in R are not needed). Generally along the lines of what Bioconductor emphasizes for future package development:

Re-use existing S4 classes and generics where possible.

By creating a core set of classes that can be built upon, future developers can ensure interoperability within R. For example, I have created a class that contains the genind object from adegenet. These are still valid genind objects, so all of the methods associated with genind objects are also associated with the genclone objects and I didn't have to re-invent the wheel in terms of creating new methods to compute things like expected heterozygosity.

Additionally, Bioconductor has a long presentation discussing the use of S4 classes and methods.

thibautjombart commented 9 years ago

I think it is essential indeed to reuse existing class wherever possible. S4 is nice as inheritance is possible, and makes classes easier to change too. Zhian, in your example, I am not sure adding a level of hierarchy to @pop justifies what is effectively a new class. If this problem is general enough (feedback from the community will be useful here), the simplest course of action would be adding a new slot to the genind class.

Some classes have been around and used for a while, meaning they roughly do the job. I think we can build upon them, update what is necessary and favour interoperability. Metadata can be anything and everything, and will to some extent depend on the type of data - DNA sequences, allele frequencies, phylogenetic trees will have their peculiarities.

On Mon, Mar 9, 2015 at 8:23 PM, Zhian N. Kamvar notifications@github.com wrote:

I was thinking more along the lines of the data representation within R as opposed to flat file format (That's not to say different file format handlers in R are not needed). Generally along the lines of what Bioconductor emphasizes http://www.bioconductor.org/developers/package-guidelines/ for future package development:

Re-use existing S4 classes and generics where possible.

By creating a core set of classes that can be built upon, future developers can ensure interoperability within R. For example, I have created a class that contains the genind object from adegenet https://github.com/grunwaldlab/poppr/blob/master/R/classes.r#L95-L99. These are still valid genind objects, so all of the methods associated with genind objects are also associated with the genclone objects and I didn't have to re-invent the wheel in terms of creating new methods to compute things like expected heterozygosity.

Additionally, Bioconductor has a long presentation discussing the use of S4 classes and methods http://www.bioconductor.org/help/course-materials/2010/AdvancedR/S4InBioconductor.pdf .

— Reply to this email directly or view it on GitHub https://github.com/NESCent/r-popgen-hackathon/issues/8#issuecomment-77934214 .

zkamvar commented 9 years ago

My example actually does what you suggest and adds new slots onto the genind class (the @hierarchy slot is used to set up the data and feed it into the @pop slot). Admittedly, my initial suggestion is a bit of a lofty goal and, pragmatically, building off of the existing classes would be the thing to do (besides, adegenet already contains the modular virtual classes: gen, popinfo, and indinfo).

Perhaps an alternative would be to construct a short tutorial for future developers that outlines the following:

The goal for either direction is to encourage future developers to contribute while maintaining interoperability between the packages and lowering the activation energy needed to do so.

Thoughts?

warnes commented 9 years ago

Hi Everyone,

I feel it is absolutely essential to provide a 'referece' R object class to store raw genetics data and annotations along with appropriate tools to import/transfor/export between this and common data formats.

In 2004, the I and the other members of R-Genetics project ( https://sourceforge.net/projects/r-genetics/) developed the GeneticsBase package for BioConductor for this purpose. The basic code is now somewhat date, but should serve as a good foundation a modern update/reimplementation.

One of my desires for the Hackathon is to revive, update, and extend GeneticsBase and the other R-Genetics project packages, building appropriate tools to integrate with the current crop of genetics tools, both R and stand-alone.

The source code for all of the R-Genetics packages is available in the SourceForge CVS repository at http://r-genetics.cvs.sourceforge.net/viewvc/r-genetics/.

-Greg

On Mon, Mar 9, 2015 at 3:09 PM, Sean notifications@github.com wrote:

Are you thinking kind of like a conversion engine, along the lines of PGDspider (http://www.cmpg.unibe.ch/software/PGDSpider/#Introduction) or Create (https://bcrc.bio.umass.edu/pedigreesoftware/node/2), but of course within R? I think that would be great and I think others would be interested too. Allan Strand and I have talked a bit about this before.

— Reply to this email directly or view it on GitHub https://github.com/NESCent/r-popgen-hackathon/issues/8#issuecomment-77921188 .

"Whereas true religion and good morals are the only solid foundations of public liberty and happiness . . . it is hereby earnestly recommended to the several States to take the most effectual measures for the encouragement thereof." Continental Congress, 1778

thibautjombart commented 9 years ago

Hi there, most classes are documented already in their respective packages, but a document providing an outline of the different classes, their structures and accessors would surely be useful. I think this was Emmanuel's idea as well. While useful for everyone, I think the emphasis should be more on the users than on the developers though. Usually, contributors / package developers seem to be OK figuring out class contents.

On Tue, Mar 10, 2015 at 5:11 PM, Zhian N. Kamvar notifications@github.com wrote:

My example actually does what you suggest and adds new slots onto the genind class (the @hierarchy https://github.com/hierarchy slot is used to set up the data and feed it into the @pop https://github.com/pop slot). Admittedly, my initial suggestion is a bit of a lofty goal and, pragmatically, building off of the existing classes would be the thing to do (besides, adegenet already contains the modular virtual classes: gen, popinfo, and indinfo).

Perhaps an alternative would be to construct a short tutorial for future developers that outlines the following:

  • a list of the different classes and the type of data they are good for
  • why data classes are necessary and useful
  • examples of utilizing inheritance to add new functionality
    • in S4 classes
    • in S3 classes
    • from S3 to S4 classes

The goal for either direction is to encourage future developers to contribute while maintaining interoperability between the packages and lowering the activation energy needed to do so.

Thoughts?

— Reply to this email directly or view it on GitHub https://github.com/NESCent/r-popgen-hackathon/issues/8#issuecomment-78100640 .

hlapp commented 9 years ago

The source code for all of the R-Genetics packages is available in the SourceForge CVS repository at http://r-genetics.cvs.sourceforge.net/viewvc/r-genetics/.

Would it be worth converting that to Git?

emmanuelparadis commented 9 years ago

Hi, All these discussions going on make me think that we have already a lot of good stuff for population genetics in R. So yes I agree that a "synthesis" of the available information in a friendly way would be great.

grunwald commented 9 years ago

I agree with all the great posts. Synthesis of available tools in a primer/wiki and move to github would be great.

peterdfields commented 9 years ago

+1!

warnes commented 9 years ago

Yes, absolutely.

I'm very time constrained this week because of family health issue, so it would be a great help of someone could assist with doing this.

Actually, the most recent Code for these packages is probably in the BioConductor svn tree.

Change your thoughts and you change the world. --Dr. Norman Vincent Peale

On Mar 10, 2015, at 4:27 PM, Hilmar Lapp notifications@github.com wrote:

The source code for all of the R-Genetics packages is available in the SourceForge CVS repository at http://r-genetics.cvs.sourceforge.net/viewvc/r-genetics/.

Would it be worth converting that to Git?

— Reply to this email directly or view it on GitHub.