danmaclean / gee_fu

An extensible Ruby on Rails web-service application and database for visualising HTGS data
18 stars 5 forks source link

GitHub feature #27

Closed mrship closed 11 years ago

mrship commented 11 years ago

@danmaclean I've started looking at adding the GitHub feature and realised that I don't have a clear idea of how you want to implement it.

  1. Do you want all GFFs that are uploaded to be stored to GitHub automatically as well?
  2. Do you want a simple UI to choose from the GFFs in the database to upload to the Github repo?
  3. Do you want these to go to danmaclean/gee_fu, or do you want to be able to specify that in some config setup?
  4. (assuming the later for 3) Do you envisage the repos that GFFs are posted would be private? If so we need to consider username/password etc.

Any other thoughts?

Let me know and I'll review further.

danmaclean commented 11 years ago
  1. Yes, mostly. The stuff that goes into the database should come down into the Github when it is new with the extra information on who and when.
  2. No, this should be entire and automatic.
  3. The app needs to write to a user specifiable repo (especially if this is going to be a downloadable app and not just a single instance). And to recreate a specified directory structure starting at a specified point.
  4. No, I would think these would be public.

The data from the db needs to come into this sort of data structure for github, Im not sure whether this is possible, so let me know if this is not how it works.

├── org_1 #a separate folder for each organism │   ├── genome_1 #with sub folders for each new genome │   │   ├── experiment_1 #and sub folders for each new experiment. │   │   │   ├── experiment.info #a summary file with the experiment metadata │   │   │   ├── features.gff #the current contents of the features table - for that experiment - including extra audit info │   │   │   └── predecessors.gff #the predecessors for the features in this experiment │   │   ├── experiment_2 #some other experiment │   │   ├── genome.info #metadata for this genome │   │   └── sequence.fa #the sequence of the references for this genome in multi fasta format │   ├── genome_2 #some other genome │   └── org.info #meta data for this organism └── org_2 #some other organism

Hope this makes sense..

mrship commented 11 years ago

I'm making progress with this, but question what the name of the folder of the organism should be - presumably the genus of the organism? If so, we'd need to make the genus unique across all organisms in the database. Not a problem, technically, but is it an issue with the organisms you need to create? Would a combination of genus/species be a better unique key?

danmaclean commented 11 years ago

The genus and species could conceivably not be unique as strains or ecotypes within species could be used. (This wasnt an issue in the old design).

So if we could have something like 'ncbitaxid''genus''species''strain'dbid'id' where the bits in quotes get replaced for each entry or replaced with Eg 'no_ncbi_taxid' when not available.

I know this seems long winded but its sort of important for readability of the paths that get created, by the eventual lab user.

Thanks Dan

On 18 Apr 2013, at 13:43, mrship notifications@github.com wrote:

I'm making progress with this, but question what the name of the folder of the organism should be - presumably the genus of the organism? If so, we'd need to make the genus unique across all organisms in the database. Not a problem, technically, but is it an issue with the organisms you need to create? Would a combination of genus/species be a better unique key?

— Reply to this email directly or view it on GitHub.

mrship commented 11 years ago

Ok, I'm working on that basis and making progress. I should have something to show tomorrow.

mrship commented 11 years ago

In reviewing the data for Organisms, where you've asked for:

---- org.info #meta data for this organism

What meta data are you looking for? There's no meta.yml associated with an Organism (at this stage), so is this just the organism attributes written out as YAML?

Also, I'm going to take a step back on this to link the Genomes to a specific Organism (per your outline of the data model) so that the hierarchy exists to parse to create the folder structure, i.e. remove belongs_to: assembly and add in belongs_to organism.

danmaclean commented 11 years ago

So in the sample data, you are right, there isn't any metadata, because earlier versions had one organism only, so it would have been just the stuff in the table. Its worth using that now still and if they provide metadata in that yaml bolting that on.

I may have confused the issues of metadata in my description above, really the metadata here is the database data (its just that its contents get called meta data in our other repos that look like this one (e.g https://github.com/ash-dieback-crowdsource/data/) apologies for this

mrship commented 11 years ago

Its worth using that now still and if they provide metadata in that yaml bolting that on.

What do you mean? There is no YAML associated with an Organism. You only upload the meta.yml file under a genome - does that need to change?

Or, are you just wanting it to output the organism attributes as a yaml file?, i.e.:

organism:
  genus: "Arabidposis"
  species: "thalian"
  strain: "Col 0"
  pathovar: "A"
  taxid: "3702"
danmaclean commented 11 years ago

OF course, no yaml uploaded no meta, my mistake. Sorry for the confusion. I think the organism attributes should be extended a little to make sure we get enough data, there should be an extra attribute - 'local name' which should be unique to the organism. Is that doable and easy enough?

mrship commented 11 years ago

I can add 'local name' easily enough, sure. To confirm, that would be unique across all organisms in the database? If so how do you want that to fit into the folder naming (when I get back to that?)

Also, re: the YAML output for organisms, is what I proposed correct? (subject to adding in the new local name attribute). i.e.

organism:
  local name: "Everybody's favourite"
  genus: "Arabidposis"
  species: "thalian"
  strain: "Col 0"
  pathovar: "A"
  taxid: "3702"
danmaclean commented 11 years ago

This YAML looks great!

mrship commented 11 years ago

OK, so going back to the folder name - can we use the local_name for the folder and have that YAML as a file within it? Rather than construct the longer folder name from all the attributes?

danmaclean commented 11 years ago

Great idea!

mrship commented 11 years ago

OK, the default YAML for an organism is:

.--- !ruby/object:Organism
attributes:
  id: 1
  genus: Arabidopsis
  species: thaliana
  strain: Col 0
  pv: A
  taxid: 3702
  created_at: 2013-04-19 13:33:52.728075000 Z
  local_name: My favourite organism

I can change it to what we agreed (above) but are any of these additional attributes useful?

mrship commented 11 years ago

As an update, the data hierarchy stuff obviously took away focus from the Github feature but I'm back on that now and deep in the weeds working through getting the correct folder structure. I'll let you know when there's something to show.

mrship commented 11 years ago

@danmaclean can you comment on the yaml format 2 comments up please?

danmaclean commented 11 years ago

Thats great, is it possible to add created_by from PaperTrail?

mrship commented 11 years ago

Well, the "whodunnit" in PaperTrail indicates who has changed an element, so you'd only get that from someone making a change to an Organism, rather than the initial create. If you want to know who created the Organism (and other models) in the first place then I'll have to add that in separately. Is that a requirement?

danmaclean commented 11 years ago

I see, I guess just popping the email address of the person onto the organism (and genome and experiment but not features, because they should be created with the experiment) is a fairly simple thing and just needs a small migration. If so it'd be good to have on there.

Im conscious of coming up with too many little things that get in the way of the the headline goal so do let me know if I suggest somethings that are getting in the way.

mrship commented 11 years ago
  1. OK, I can add the user_id into the organism easily enough, but do you want it on the other models as well? i.e. who created a specific Genome or Experiment?
  2. Also, do you have anything that writes a Fasta file from the References? Looking at the bio gem it appears that it only reads Fasta files? You mention above that you wanted the Fasta file for the genome in the git repo but you don't save the input file anywhere, so it would appear that it needs re-constructing from the data in the database? That would seem complicated - can you advise?
danmaclean commented 11 years ago
  1. Yes please.
  2. assuming Reference.sequence.sequence returns a String object then this should work to get the thing into a fasta format

https://gist.github.com/danmaclean/5425505

mrship commented 11 years ago
  1. OK, I'll look at putting that in.
  2. Per that Gist is looks easy enough. I'll let you know how I get on with that.
  3. Per your directory structure outline, as I reach Experiments what's the best way to construct the Feature and Predecessor GFF?
danmaclean commented 11 years ago

@mrship 3. The Feature and Predecessor objects both have to_gff methods that return a GFF string.

danmaclean commented 11 years ago

@mrship actually, the Feature#to_gff method rewritten like this https://gist.github.com/danmaclean/5426143 probably will work better, as it uses the proper GFF3 class in BioRuby. The method will allow you to pass extra values such as created by or whatever as a hash, though it seems like the keys of the hash must be strings, the to_a method appears to be throwing out symbols

danmaclean commented 11 years ago

@mrship and this slight elaboration for the Predecessor#to_gff https://gist.github.com/danmaclean/5426171

mrship commented 11 years ago

With PR #31 we have the initial folder structure being exported to a configurable folder (currently /repository). I'll continue to work on that to get the correct YAML and GFF/FASTA files output. However, questions now are on how this folder structure works with Git/Github:

  1. Do you want me to code a feature to push the repo automatically? Or would it be enough to just commit the changes manually and push?
  2. Assuming this needs to be automatic, do you want a branch creating for the repository changes so that they can be tracked separately and not necessarily merged into the master branch?
  3. Is it OK for the folder to remain part of the GeeFU repository? You'd need to think carefully about cloning the repo to run your own copy of the code as the repo would be included automatically - unless you have repository changes under a separate branch.

Also, with respect to creating the repository folder structure:

  1. Is it sufficient to have a rake task that is run manually?
  2. If it is to be automatic we need to consider the best way to run a job to create the folder. For example, it might be costly to run it after every change to the database. Perhaps it makes more sense to run a nightly job instead?

I've included SideKiq so we can have a job that runs separately to the main site to create the repository (and perhaps sync to GitHub automatically, subject to Q1-Q3) but let me know your thoughts on when the job should run.

danmaclean commented 11 years ago
  1. Ideally yes. I think we want as low as possible on overhead on managing the app.
  2. As you point out the best pay off is that we have no 'mess' with and the biological data can be kept in an entirely separate repo to GeeFu, so we do need them to be tracked separately and not on the master branch, and ideally not in any other branches to do with the code.
  3. Not really, this would get messy if we had different instances of the app, as one set of data would need taking out and then there could be problems.

I was wondering if all this could be achieved with Git SubModules. I've read about these (and perhaps my reading is out of date http://git-scm.com/book/en/Git-Tools-Submodules) but never used them. Seems to me that setting the data repo up within the code repo to allow it to change independently should be possible. Is it imperative that it is either a branch or in the code repo?

  1. Probably not, no. We need to keep this as admin free as possible,
  2. A nightly job will be fine, I think. Unless SideKiq doesn't use lots of resources exporting.

Let me know if this answers enough.

danmaclean commented 11 years ago

@mrship As a side note, Im not averse to this whole repo depositing thing being a completely separate server side script running outside the app. If this gets horrendous Im happy for the repository dir to go into gitignore, and to have a script (maybe even running under cron or similar) to just read that folder and push to an arbitrary repo. Im all for simplicity in this and can see that this might get complicated. Let me know your views on this, I know it seems a cop out but I don't think we need to be so stuck on having the app implement this feature, so long as we have a piece of software that does.

mrship commented 11 years ago

For now, I'll implement _createdby for Organism, Genome and Experiment, FASTA export and GFF export and then we can review the best way to store the folder structure. All of the above have pros and cons, not least in time taken to develop!

mrship commented 11 years ago

OK, see #33 for a (corrected) PR to add 'whodunnit' to the YML. I'll look at the FASTA/GFF export next.

mrship commented 11 years ago

Fasta support per #35. GFF support next. Then we can review how to store this folder structure.

mrship commented 11 years ago

See #37 for GFF support. Can you please run rake repo:export on your end to test and let me know if you're happy with the YAML formats that I've put in place. Question:

A. Do you want the email address of the person making a change to any of the 3 models as well as their name?, i.e

---
experiment:
  Name: TAIR9 GFF
  Description: first one
  Last updated by: Andy Shipman (andy@example.com)
  Last updated on: 23 April 2013

B. Now we have the folder structure and data output, we need to decide how to share this folder structure. Have you decided if you want to leave it to the user, or automate it with git sub-modules/other repo? Obviously the simplest approach (for me) is for there to be documentation to provide instruction on how to push to a separate git repo rather than automating it, but it's really your decision how to proceed at this point!

danmaclean commented 11 years ago

@mrship Can't at the moment. rake repo:export gives this ( failing cause of reference without id.. ) https://gist.github.com/danmaclean/5444205

Starting app to reload data fails with foreman s cause of SideKiq, apparently.. https://gist.github.com/danmaclean/5444177 Any tips?

danmaclean commented 11 years ago

But wrt questions A. Yes, please. B. I'd need it to be automated - somehow- . But will your take advice on difficulties with different approaches.git submodules sounds nice but seems hard. Can we have the app push to a different github repo? If not, a script in a separate process that runs the rake task and pushes to a defined repo will suffice. Any estimates on time?

mrship commented 11 years ago

Ah. I forgot for SideKiq you need redis installing. You can either

  1. Install that (do you use homebrew? If so brew install redis)
  2. Remove the worker line in the Procfile
  3. Or I can remove SideKiq for now. See PR #38.
mrship commented 11 years ago

Regarding the issue for a missing reference_id, I'm not sure how best to address. Basically, what happened (prior to PR #36 was that the reference_id didn't get set when a Feature was updated (and therefore a new Feature created), i.e: this was the code

    @feature.reference_id = Reference.find(:first, :conditions => {:genome_id => genome_id, :name => @feature.seqid} ) || old_feature.ref_id #note missing id on the Reference find

whereas now the code is:

    @feature.reference_id = Reference.find(:first, :conditions => {:genome_id => genome_id, :name => @feature.seqid} ).id || old_feature.ref_id

So, what I could do is run over all Predecessors to find their reference_id and map that back to the Features that got updated. I can create a migration for that if you want - or can we ignore old (corrupted) datasets?

mrship commented 11 years ago

Back to the other questions: A. Done. See #39 B. I'll look into it more and suggest an automated solution.

danmaclean commented 11 years ago

@mrship I do use homebrew and getting redis installed was fine, like a numpty though I already merged the PR. I won't revert, as I'll break something. Fine to put back in as the app now works locally.

Regarding broken data, I thought that I'd be able to delete the data through the web interface and reenter it to get round the issue. But that raised another issue https://gist.github.com/danmaclean/5444903 . This aside is there a reason just replacing the data wouldn't work?

mrship commented 11 years ago

It seems my testing of add a new experiment didn't include a YAML file (i.e. I was using the one attached to the Genome). I'll fix.

As for replacing the data - do you mean locally on your dev box and then re-running the rake repo:export? If so, that's exactly what I did and it works fine thereafter.

mrship commented 11 years ago

OK, for a more automated solution to pushing the data, I suggest that we have the data as a different repository within the code repo. It will, however, be totally separate and .gitignore'd from the main repo (see #40). The reason for this thinking is that git sub-modules are really for code you want in the main repo rather than, in this case, data that we don't want in the repo.

As an example, see the repo I've created under @mrship

Here are the setup steps to create a separate GitHub repo for the data.

I foresee then that we create a rake task that runs both the creation of the repo with repo:export and runs

git add .
git commit -am 'Update GeeFU data'
git push origin master

That can then be automated with cron as required.

That should be automated enough, IMO. What do you think?

danmaclean commented 11 years ago

As for replacing the data - do you mean locally on your dev box and then re-running the rake repo:export? If so, that's exactly what I did and it works fine thereafter.

Brilliant! I'll get on that, then.

This seems automated enough, yes. How would this be implemented? Through old style cron requiring a user to setup a specific cron task, or can this be done with something like whenever http://rubydoc.info/gems/whenever/0.8.2/frames ? Don't know if this is any good.

mrship commented 11 years ago

See #41 for a fix to uploading a YAML file with an experiment.

I'll look at whenever to automate the git push next.

mrship commented 11 years ago

See #43 for an automated solution. It runs a separate process - from the Procfile - that creates a SideKiq job to create the data repo and then push it to GitHub. Obviously it'll only work if you've setup the data git repo.

mrship commented 11 years ago

Of note, this is pretty basic as it just shells out to git commands. If you need it wrapping up so that it can cope without being setup first then I'll have to review further.

mrship commented 11 years ago

Actually, we do need a check in place as otherwise the data would get added to the main repo! See #44 for a fix.

mrship commented 11 years ago

I believe this is now complete. Re-open if I've missed something.