SACGF / cdot

Transcript versions for HGVS libraries
MIT License
29 stars 5 forks source link

Use Snakemake to build transcripts #70

Open davmlaw opened 8 months ago

davmlaw commented 8 months ago

At the moment we have file existence tests instead of proper dependency management

davmlaw commented 8 months ago

Would be good to automate uploading releases as this is pretty tedious, could do:

gh release create <tag> --title "<release title>" --notes "<release notes>"
gh release upload <tag> <path/to/your/files/*>
davmlaw commented 7 months ago

Made a script "generate_transcript_data/github_release_upload.sh" which makes a release easier

davmlaw commented 2 months ago

Looking at the bash scripts, a lot of the complexity is due to looping over URLs and dealing with RefSeq URLs having identical file names, eg:

"https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20190906/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz"
"https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20201022/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz"
"https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20220307/GCF_000001405.25_GRCh37.p13/GCF_000001405.25_GRCh37.p13_genomic.gff.gz"

So it's not so easy to just download it and carry on. I think with SnakeMake we should just explicitly list everything out in YAML files, and use that config to run a pipeline common between everything

We could make urls a dictionary, and then have the "nice name" for it as a key. That would allow us to move code into config which would be a lot nicer

davmlaw commented 2 months ago

ok, I have started on this (in generate_transcript_data)

I wanted to run the code with different config files, but couldn't work out a way to do it. I think SnakeMake seems to only want 1 config file. I thus combined everything in "config/*.yaml" into "cdot_transcripts.yaml"

having an issue at the moment with ambiguous rules for downloading files

davmlaw commented 2 months ago

@tedil @holtgrewe - I've finished v1 of the SnakeMake pipeline - if you could check it out as it's the first one I ever wrote:

https://github.com/SACGF/cdot/blob/main/generate_transcript_data/Snakefile https://github.com/SACGF/cdot/blob/main/generate_transcript_data/cdot_transcripts.yaml

Happy to hear feedback / if I should have structured it a different way etc.

tedil commented 2 months ago

Great, thank you! I will have a look when I am back from vacation

davmlaw commented 2 months ago

Sure, no hurry, enjoy your time off