Conte-Ecology / conteStreamTemperature

Package for cleaning and analyzing stream daily stream temperature
MIT License
1 stars 1 forks source link

package location #5

Closed djhocking closed 9 years ago

djhocking commented 10 years ago

We are accumulating many repos on GitHub in Conte-Ecology and github doesn't allow for hierarchies of repos. There is some concern that this will create a mess and when we come back to projects after time away or as new people are added, it will be overly confusing to determine which repo to use.

This is particularly the case for the temperatureProject and streamTemperature repos. streamTemperature is the R package that will hold the functions, documentation, test data, and vignettes. temperatureProject is the repository for conducting analyses (RMarkdown files) and writing reports and manuscripts. temperatureProject Rmd scripts install and load the streamTemperature Given the similarity in the names, this could be quite unclear in the future.

There are a few options that I can see going forward but I'm unsure of the pros and cons of all and different people's preferences.

  1. Keep things separate and independent but each person can keep their cloned folders in directory structures that make sense to them. For example, my folder structure might be ~/Research/Stream_Climate_Change/temperatureProject for the analysis, reports, manuscripts, etc. and ~/Packages/streamTemperature for the streamTemperature package and other packages that I work on.
  2. Change the names of repos to reflect whether they are packages or not (i.e. packageStreamTemperature. I think this would work and the package could still be named streamTemperature
  3. Put streamTemperature as a package within temperatureProject. I'm not sure if this would cause problems with the R package build using RStudio, devtools, and roxygen2 or with git and GitHub. Although potentially complicated to have git nested within git it might be taken care of by adding streamTemperature/ to the .gitignore file within temperatureProject. This could also make sense if we wanted to host reports on GitHub using the gh-pages option. Then everything is in 1 repo. I'm just not sure if this would all work or if it would cause more headaches than it's worth.
  4. Put streamTemperature and all other packages as subfolders within a single repository called packages or contePackages. The downside of this (and maybe option 3) would be commits and reverts and such could get REALLY messing with git/GitHub.

Any opinions or other options?

walkerjeffd commented 10 years ago

I think with just a little organization and a clearly defined workflow. Here are my thoughts:

  1. Each package should definitely have its own repo. This is important so that the package can easily be installed by (e.g. for streamTemperature):

    library(devtools)
    install_github("Conte-Ecology/streamTemperature")
    library(streamTemperature)
  2. The current repo names are indeed confusing. I was actually going to open an issue about this. At first glance, it's definitely unclear what the difference between streamTemperature and temperatureProject are.
  3. Start making better use of README files for each repo. This should clearly define what is in the repo, how to install it (if its a package), or how to run the analyses (if its Rmd files). For example, that little code snipped I put above in item 1. should be shown in the streamTemperature/README.md file.
  4. I think we should use gh-pages to create a homepage for the organization (more for internal use than external use). This would provide a list of the repos (maybe grouped by package, analysis, and other categories). Just a single page that listed out what each repo is and what its for should be enough. This would be done by creating a repo called Conte-Ecology.github.io and just putting simple HTML files in there (or we could use Jekyll but thats more involved). I can set this up if you want.
  5. As we develop the packages and create analyses using them, it will be important to keep track of which version of the package was used for each analysis. So two things to address this problem are 1) using git tags to specify version numbers (just run git tag v0.0.1 to assign a tag to the current commit, then git push --tags to push the tags to github (note that tags are not pushed by default)), 2) in each analysis Rmd document, run print(sessionInfo()) at the very bottom of the document, which will write out the versions of R itself and all packages that were loaded when the document was compiled (so if someone needs to go back and re-run the analysis, they'll know what version of streamTemperature package was used).
  6. For analyses, this could be done a few different ways. It depends on what the analyses are, how often new ones are created, and how many people are doing the analyses (one or many). So maybe a good place to start is to list out what the expected analyses will be/look like. Will there be just one large analysis that would become a report, or lots of smaller ones? In any case, if all the analyses are done as Rmd files then they can be compiled to html. The repo for the analysis could then have a gh-pages branch where the output are stored and updated. This could also include an index.html file that lists the analyses and provides some description about what they are for.
anarosner commented 10 years ago

Hi, folks. I'm going to try to respond to a few different emails in this one, which I know isn't ideal. But, there have been so many, and I was really trying to focus on things to get done by today, that I haven't kept up with them.

First, though, there has been a lot of discussion of big picture organizational/structural stuff, and I'm finding it difficult to make sense of it all via email. I'd like to propose to have a meeting (with Jeff either in person or by phone). I had some conversations about it w/ Ben and Kyle this morning, and some with Kyle and Dan a couple days ago... seems it would be better to have everyone in the conversation at once. Maybe I'll send out a doodle poll?

With that said, here are some of my thoughts:

-Package organization-

I agree with Jeff that each package should have its own repo, this seems really crucial for installing using install_github(). So, I would vote that each package should be its own repo, and depending on how independent different groups of functions/processes are, maybe what are now big clusters of scripts should be broken into multiple packages and maybe not. Then the analysis would be a repo that is not a package; again, some analyses might be grouped under one repo, and some on their own. In my opinion, indicating this organization can probably be handled in the naming. Though, at the same time, I'm weary of packages that have long/complicated/difficult to remember names.

I've thought a lot about naming and organization for my packages and analysis repos, and tried a few different things. (If you've noticed me creating empty repos and then changing them or deleting them, etc, you might know that I've been trying a few different things.) I'm not sure if this is right for everyone or if it's even what I'll settle on finally; but, I think it might be worth it at an upcoming meeting to walk folks through the organization, explain my thinking, and get your input. I have some quick notes about the organization that maybe I'll put... not sure where... in the readme under my repo?

-Versions- Keeping track of versions of packages seems important. I've tried out adding a git tag, but I'm not sure how it works. I really like the idea of printing sessionInfo() in each analysis script. When I do that, I get the version of the package set in the DESCRIPTION file, but not from the git tag. Anyhow, this is something we can get into later.

-github pages- I'm on board with using github pages. I think they can be useful both as a sort of table of contents and overview of each modeling project; and a place to host the analyses that are knit into markdown/html; and the demonstrations of packages that are knit from rmd into html. (These knit html need to be on gh-pages branches, because when you view one of these files in your master branch repo on github.com, it is not rendered html, just code. This was a surprise to me a couple weeks ago... an annoying one.) Based on my inital foray into gh-pages, there are a number of kinks to work out, but I'll respond to that separately.

-Other things- From some of the emails bouncing around the past few weeks, there were mentions of projectTemplates and of vignettes. Are these still part of the dicussion? It seems we're agreed that packages have a lot of advantages to projectTemplates, and I'm not sure if there's a use for projectTemplate in addition to packages. Also, I'm not sure what advantage vignettes for the analysis (not package demos) have over knit markdowns. I haven't looked much into either of these issues, but just wanted to check to see if they were still

Also, when I started turning my scripts into packages, I thought I was going to be the group guinea pig, and have been putting together some notes to turn into a reference for making packages, documenting, etc. I know that others have started the packagizing process (my new word) already, but maybe these notes will still be helpful. I could probably send out next week, if there's interest. I also have some thoughts about general approaches to make scripts packages that I'm jotting down notes about and hopefully get some responses to.

Thanks, Ana

walkerjeffd commented 10 years ago

Trying to make this brief...

Package Organization Agreed. Each package is one repo. Analyses are not bundled as packages.

Versions You'll have to manually keep the git tag in sync with the version in the DESCRIPTION file. So when you update the version of the package to 0.2.0 and set that in DESCRIPTION, and you have committed all the updates, then just run git tag v0.2.0. There are also Releases on Github, but tags should be fine for now.

Github Pages Dan and I think we found a way to transfer files from master to gh-pages and keep them in sync with just a couple commands. It involves rebasing, which sounds a little scary if you haven't done it, but is a useful thing to learn anyway. Looks like Dan found a way to set this up so it happens automatically whenever you commit to master.

If you want to view the output of an Rmd file on github, there are two options.

  1. configure RStudio to save the intermediate .md file (somewhere in the project settings). Github will render a .md document (though it won't look the same as the *.html file, especially if you use themes). 2. use http://rawgit.com/

Example: html file on rawgit, md file on github, and Rmd source file.

Analyses Whether we go with ProjectTemplate or not for analyses, probably should come up with a consistent file structure for the analyses. No real difference between a vignette and a knitr markdown these days. Just that a vignette is a knitr markdown that lives in a package and is part of the packages vignettes. So we can just call the analyses markdown documents. And each package should have one or more vignettes (as markdown documents) that give an example analysis on how to use the package functions.

Notes We definitely need to compile all of our notes somewhere, and also write out how we decide to do everything. There are quite a few instructions that we could write up (like how to develop a package, how to use git tags, how to sync master and gh-pages, etc.). I'm thinking we should have a repo that could serve this purpose, and we could use the wiki. There is already Conte-Ecology/getting-started, but maybe we integrate that info into a new repo called Conte-Ecology/group-notes or something (there's probably a term that describes what we're trying to do here, project management? workflows?). This could also serve for a centralized location for all these issues, which are kind of getting scattered across the various repos.

I'm going to add you all the walkerjeffd/conte-web-app repo which Chris, Andrew and I are using to develop the web db and app. So you can see how we're using the wiki, which is really handy, and how we're using pull-requests and feature branching to develop this system. note: you'll probably want to turn off notifications for this repo or you'll be getting multiple emails per day about it.

anarosner commented 10 years ago

In response to the info about gh-pages. The solution that you guys (Jeff and Dan) came up with sounds like what I'm looking for. Can you point me to some references on it? I'm just now at that stage where I'd like to put some knit files up.

The other options you mention I've thought about, but I opted not to use them. I couldn't get rawgit working; I thought it might have been a temporary thing because their server was down (other people's examples from blogs, etc weren't working either), but that just made me more inclined not to set up something that was dependent on their server.

I find the md output of r scripts hard to read (too many boxes around code and output snippets), and so I hacked together a css file to make it what I think is more readable. But, I need to render html in order for that to work.

So, send what you can about rebasing and I'll give it a go.

Thanks, Ana

p.s. Which gh-pages, or which repo should I look at to find the example Dan has working, that pulls from commits to master branch?

djhocking commented 10 years ago

I haven't responded yet because I haven't had time to try to get it working in full yet. I was hoping to do that today but I've gotten stuck on temperature model stuff longer than planned. If you want to be the one to really try it out with you css here's the basic info:

http://lea.verou.me/2011/10/easily-keep-gh-pages-in-sync-with-master/

git add .
git status // to see what changes are going to be commited
git commit -m 'Some descriptive commit message'
git push origin master
git checkout gh-pages // go to the gh-pages branch
git rebase master // bring gh-pages up to date with master
git push origin gh-pages // commit the changes
git checkout master // return to the master branch

The second section can be substituted with a post-commit hook:

http://oli.jp/2011/github-pages-workflow/ "Paul Irish contributed this post-commit hook snippet for automating Lea’s workflow (save as .git/hooks/post-commit in your Git repo):"

    #!/bin/sh
    git checkout gh-pages
    git rebase master
    git checkout master

"This lets you replace the last five steps of Lea’s workflow with just git push --all. Nice!"

I will probably play around with it tonight, but again it might get delayed because I scheduled a call with a student to provide stats/study-design advice.

anarosner commented 10 years ago

Thanks for that, Dan. Yeah, I'm not exactly sure when I'll be able to get to this, so I guess whichever of us gets there first will slog through it. (Whoever figures it out and can show everyone else gets a cupcake?)

After a quick look at the links and code you sent, this seems to differ from references I've seen before that told me to do a git checkout --orphan gh-pages. Also I'm not clear on whether the gh-pages branch would a) include everything from the master, or b) just files (i.e. md and html) that you specify. Stuff I read previously suggested b, but it seems like these references would suggest a. It shouldn't really matter, though, right? If the gh-pages uses links from the index.html page, and you link to things like knit documents, I figure it should matter that there's also a bunch of code on that branch.

Thanks, Ana