Closed mrship closed 11 years ago
The data from the db needs to come into this sort of data structure for github, Im not sure whether this is possible, so let me know if this is not how it works.
├── org_1 #a separate folder for each organism │ ├── genome_1 #with sub folders for each new genome │ │ ├── experiment_1 #and sub folders for each new experiment. │ │ │ ├── experiment.info #a summary file with the experiment metadata │ │ │ ├── features.gff #the current contents of the features table - for that experiment - including extra audit info │ │ │ └── predecessors.gff #the predecessors for the features in this experiment │ │ ├── experiment_2 #some other experiment │ │ ├── genome.info #metadata for this genome │ │ └── sequence.fa #the sequence of the references for this genome in multi fasta format │ ├── genome_2 #some other genome │ └── org.info #meta data for this organism └── org_2 #some other organism
Hope this makes sense..
I'm making progress with this, but question what the name of the folder of the organism should be - presumably the genus of the organism? If so, we'd need to make the genus unique across all organisms in the database. Not a problem, technically, but is it an issue with the organisms you need to create? Would a combination of genus/species be a better unique key?
The genus and species could conceivably not be unique as strains or ecotypes within species could be used. (This wasnt an issue in the old design).
So if we could have something like 'ncbitaxid''genus''species''strain'dbid'id' where the bits in quotes get replaced for each entry or replaced with Eg 'no_ncbi_taxid' when not available.
I know this seems long winded but its sort of important for readability of the paths that get created, by the eventual lab user.
Thanks Dan
On 18 Apr 2013, at 13:43, mrship notifications@github.com wrote:
I'm making progress with this, but question what the name of the folder of the organism should be - presumably the genus of the organism? If so, we'd need to make the genus unique across all organisms in the database. Not a problem, technically, but is it an issue with the organisms you need to create? Would a combination of genus/species be a better unique key?
— Reply to this email directly or view it on GitHub.
Ok, I'm working on that basis and making progress. I should have something to show tomorrow.
In reviewing the data for Organisms, where you've asked for:
---- org.info #meta data for this organism
What meta data are you looking for? There's no meta.yml associated with an Organism (at this stage), so is this just the organism attributes written out as YAML?
Also, I'm going to take a step back on this to link the Genomes to a specific Organism (per your outline of the data model) so that the hierarchy exists to parse to create the folder structure, i.e. remove belongs_to: assembly
and add in belongs_to organism
.
So in the sample data, you are right, there isn't any metadata, because earlier versions had one organism only, so it would have been just the stuff in the table. Its worth using that now still and if they provide metadata in that yaml bolting that on.
I may have confused the issues of metadata in my description above, really the metadata here is the database data (its just that its contents get called meta data in our other repos that look like this one (e.g https://github.com/ash-dieback-crowdsource/data/) apologies for this
Its worth using that now still and if they provide metadata in that yaml bolting that on.
What do you mean? There is no YAML associated with an Organism. You only upload the meta.yml file under a genome - does that need to change?
Or, are you just wanting it to output the organism attributes as a yaml file?, i.e.:
organism:
genus: "Arabidposis"
species: "thalian"
strain: "Col 0"
pathovar: "A"
taxid: "3702"
OF course, no yaml uploaded no meta, my mistake. Sorry for the confusion. I think the organism attributes should be extended a little to make sure we get enough data, there should be an extra attribute - 'local name' which should be unique to the organism. Is that doable and easy enough?
I can add 'local name' easily enough, sure. To confirm, that would be unique across all organisms in the database? If so how do you want that to fit into the folder naming (when I get back to that?)
Also, re: the YAML output for organisms, is what I proposed correct? (subject to adding in the new local name attribute). i.e.
organism:
local name: "Everybody's favourite"
genus: "Arabidposis"
species: "thalian"
strain: "Col 0"
pathovar: "A"
taxid: "3702"
This YAML looks great!
OK, so going back to the folder name - can we use the local_name for the folder and have that YAML as a file within it? Rather than construct the longer folder name from all the attributes?
Great idea!
OK, the default YAML for an organism is:
.--- !ruby/object:Organism
attributes:
id: 1
genus: Arabidopsis
species: thaliana
strain: Col 0
pv: A
taxid: 3702
created_at: 2013-04-19 13:33:52.728075000 Z
local_name: My favourite organism
I can change it to what we agreed (above) but are any of these additional attributes useful?
As an update, the data hierarchy stuff obviously took away focus from the Github feature but I'm back on that now and deep in the weeds working through getting the correct folder structure. I'll let you know when there's something to show.
@danmaclean can you comment on the yaml format 2 comments up please?
Thats great, is it possible to add created_by from PaperTrail?
Well, the "whodunnit" in PaperTrail indicates who has changed an element, so you'd only get that from someone making a change to an Organism, rather than the initial create. If you want to know who created the Organism (and other models) in the first place then I'll have to add that in separately. Is that a requirement?
I see, I guess just popping the email address of the person onto the organism (and genome and experiment but not features, because they should be created with the experiment) is a fairly simple thing and just needs a small migration. If so it'd be good to have on there.
Im conscious of coming up with too many little things that get in the way of the the headline goal so do let me know if I suggest somethings that are getting in the way.
@mrship 3. The Feature and Predecessor objects both have to_gff
methods that return a GFF string.
@mrship actually, the Feature#to_gff method rewritten like this https://gist.github.com/danmaclean/5426143 probably will work better, as it uses the proper GFF3 class in BioRuby. The method will allow you to pass extra values such as created by or whatever as a hash, though it seems like the keys of the hash must be strings, the to_a method appears to be throwing out symbols
@mrship and this slight elaboration for the Predecessor#to_gff https://gist.github.com/danmaclean/5426171
With PR #31 we have the initial folder structure being exported to a configurable folder (currently /repository). I'll continue to work on that to get the correct YAML and GFF/FASTA files output. However, questions now are on how this folder structure works with Git/Github:
Also, with respect to creating the repository folder structure:
I've included SideKiq so we can have a job that runs separately to the main site to create the repository (and perhaps sync to GitHub automatically, subject to Q1-Q3) but let me know your thoughts on when the job should run.
I was wondering if all this could be achieved with Git SubModules. I've read about these (and perhaps my reading is out of date http://git-scm.com/book/en/Git-Tools-Submodules) but never used them. Seems to me that setting the data repo up within the code repo to allow it to change independently should be possible. Is it imperative that it is either a branch or in the code repo?
Let me know if this answers enough.
@mrship As a side note, Im not averse to this whole repo depositing thing being a completely separate server side script running outside the app. If this gets horrendous Im happy for the repository dir to go into gitignore, and to have a script (maybe even running under cron or similar) to just read that folder and push to an arbitrary repo. Im all for simplicity in this and can see that this might get complicated. Let me know your views on this, I know it seems a cop out but I don't think we need to be so stuck on having the app implement this feature, so long as we have a piece of software that does.
For now, I'll implement _createdby for Organism, Genome and Experiment, FASTA export and GFF export and then we can review the best way to store the folder structure. All of the above have pros and cons, not least in time taken to develop!
OK, see #33 for a (corrected) PR to add 'whodunnit' to the YML. I'll look at the FASTA/GFF export next.
Fasta support per #35. GFF support next. Then we can review how to store this folder structure.
See #37 for GFF support. Can you please run rake repo:export
on your end to test and let me know if you're happy with the YAML formats that I've put in place. Question:
A. Do you want the email address of the person making a change to any of the 3 models as well as their name?, i.e
---
experiment:
Name: TAIR9 GFF
Description: first one
Last updated by: Andy Shipman (andy@example.com)
Last updated on: 23 April 2013
B. Now we have the folder structure and data output, we need to decide how to share this folder structure. Have you decided if you want to leave it to the user, or automate it with git sub-modules/other repo? Obviously the simplest approach (for me) is for there to be documentation to provide instruction on how to push to a separate git repo rather than automating it, but it's really your decision how to proceed at this point!
@mrship Can't at the moment. rake repo:export
gives this ( failing cause of reference without id.. ) https://gist.github.com/danmaclean/5444205
Starting app to reload data fails with foreman s
cause of SideKiq, apparently.. https://gist.github.com/danmaclean/5444177 Any tips?
But wrt questions A. Yes, please. B. I'd need it to be automated - somehow- . But will your take advice on difficulties with different approaches.git submodules sounds nice but seems hard. Can we have the app push to a different github repo? If not, a script in a separate process that runs the rake task and pushes to a defined repo will suffice. Any estimates on time?
Ah. I forgot for SideKiq you need redis installing. You can either
brew install redis
)worker
line in the ProcfileRegarding the issue for a missing reference_id, I'm not sure how best to address. Basically, what happened (prior to PR #36 was that the reference_id didn't get set when a Feature was updated (and therefore a new Feature created), i.e: this was the code
@feature.reference_id = Reference.find(:first, :conditions => {:genome_id => genome_id, :name => @feature.seqid} ) || old_feature.ref_id #note missing id on the Reference find
whereas now the code is:
@feature.reference_id = Reference.find(:first, :conditions => {:genome_id => genome_id, :name => @feature.seqid} ).id || old_feature.ref_id
So, what I could do is run over all Predecessors to find their reference_id and map that back to the Features that got updated. I can create a migration for that if you want - or can we ignore old (corrupted) datasets?
Back to the other questions: A. Done. See #39 B. I'll look into it more and suggest an automated solution.
@mrship I do use homebrew and getting redis installed was fine, like a numpty though I already merged the PR. I won't revert, as I'll break something. Fine to put back in as the app now works locally.
Regarding broken data, I thought that I'd be able to delete the data through the web interface and reenter it to get round the issue. But that raised another issue https://gist.github.com/danmaclean/5444903 . This aside is there a reason just replacing the data wouldn't work?
It seems my testing of add a new experiment didn't include a YAML file (i.e. I was using the one attached to the Genome). I'll fix.
As for replacing the data - do you mean locally on your dev box and then re-running the rake repo:export
? If so, that's exactly what I did and it works fine thereafter.
OK, for a more automated solution to pushing the data, I suggest that we have the data as a different repository within the code repo. It will, however, be totally separate and .gitignore
'd from the main repo (see #40). The reason for this thinking is that git sub-modules are really for code you want in the main repo rather than, in this case, data that we don't want in the repo.
As an example, see the repo I've created under @mrship
Here are the setup steps to create a separate GitHub repo for the data.
mkdir gee_fu_data
cd gee_fu_data
echo .DS_Store >> .gitignore
git add .
git commit -am 'Setup GeeFU data repository'
git remote add origin git@github.com:[your_github_user]/gee_fu_data.git
git pull origin master
rake repo:export
git add .
git commit -am 'Add GeeFU data'
git push origin master
I foresee then that we create a rake task that runs both the creation of the repo with repo:export
and runs
git add .
git commit -am 'Update GeeFU data'
git push origin master
That can then be automated with cron
as required.
That should be automated enough, IMO. What do you think?
As for replacing the data - do you mean locally on your dev box and then re-running the rake repo:export? If so, that's exactly what I did and it works fine thereafter.
Brilliant! I'll get on that, then.
This seems automated enough, yes. How would this be implemented? Through old style cron requiring a user to setup a specific cron task, or can this be done with something like whenever http://rubydoc.info/gems/whenever/0.8.2/frames ? Don't know if this is any good.
See #41 for a fix to uploading a YAML file with an experiment.
I'll look at whenever to automate the git push next.
See #43 for an automated solution. It runs a separate process - from the Procfile - that creates a SideKiq job to create the data repo and then push it to GitHub. Obviously it'll only work if you've setup the data git repo.
Of note, this is pretty basic as it just shells out to git
commands. If you need it wrapping up so that it can cope without being setup first then I'll have to review further.
Actually, we do need a check in place as otherwise the data would get added to the main repo! See #44 for a fix.
I believe this is now complete. Re-open if I've missed something.
@danmaclean I've started looking at adding the GitHub feature and realised that I don't have a clear idea of how you want to implement it.
Any other thoughts?
Let me know and I'll review further.