Closed holtgrewe closed 10 months ago
It's a good idea to split them, though I am not sure whether we need a separate pip package, that may confuse some people
I think you can just make as many releases as you want - we could label them as dates maybe for simplicity
You are right, one can probably just checkout a given treeish (git hash, branch, tag) of the tools directory.
As for releases, I guess it makes sense to think about different streams of releases:
One could think of "latest NCBI", but I think having 2-4 releases a year that match VEP is better.
On second thought, maybe just doing a release that matches the latest VEP/ENSEMBL release would be sufficient. There, attach a file per genome release per each refseq and ensembl that matches the release used by VEP. Also, attach files that have "as many transcripts as possible". One would thus just track VEP releases which would make things more predictable.
Another idea is to track the versions of cdot and release as ${VEP_VERSION}+{CDOT_VERSION}
.
What do you think?
There are a lot of different concepts mushed together into a "cdot version"
For (2) - I think we should move to a new repo, that would fix the setup issue too in pull request 5 - just have a requirements.txt - we don't need to give it a pip package
For (3) I think we can just do it like you said, I think keep the data hosted as releases on this repo (cdot client), even though they are generated by cdot data and will have that version on it - will have a note saying you can download any data that has same major/minor
I looked and there is very minor code sharing between client/data
Thinking on it a bit more, splitting repos would lead to people raising issues in wrong places, and the repo is already pretty small in the grand scheme of things.
I will decouple code as much as possible from client/generation
I think we just need a requirements.txt and a separate JSON schema version
We can always split into 2 repos later if we decide it's best.
To summarise:
generate_transcript_data.json_schema_version.JSON_SCHEMA_VERSION
Will see how this goes can change it later we come up with a better way
It might be worth considering to create a pip-installable
cdot
package that has the reuseable Python code for GFF parsing etc. and would continue to live in one repository.The data builds could then go to a separate directory and one could have two series of releases. One using the identical data to VEP releases and one that aggregates all historical releases.