Suggestions to improve genome download script

StevenWingett commented 1 year ago

Just finished my rescheduled meeting. The format we agreed is:

The existing species_releases folder will have per species folders in it. When you're ready to start downloading you'll move the existing ones into old_data, but leave nextflow there till you've rewritten code.

Inside the species folder structure will look like this:

Ensembl

GRCh38

    Release 103

        BED

        FASTA

        GTF

        INDEXES

In the FASTA folder you will keep the original file names, but duplicate the genome file to a standard renaming scheme for nextflow. You'll add .genome or similar to it to indicate what this file is. You'll also download the cDNA fasta to this folder.

For indexes possibilities will be Bowtie, Bowtie2, Hisat2, STAR (both the version in the genomics/soft/bin and the nextflow version, folder names should indicate which version they were made with), Hi-CUP, 10X, PARSE. Not all will be made for all species - for all releases? Can your code be release specific?

It would be good to add the Human T2T assembly as an option as well.

StevenWingett commented 1 year ago

Put aligner version in folder name for indices
Split genome / cdna
Genome size file in genome folder
BED - just make empty folder for now
GFF3 file
Describe how it works in documentation and remember version used

StevenWingett commented 1 year ago

Let's not duplicate the FASTQ file sfor different releases - maybe make symbolic links to the original downloaded FASTA files

StevenWingett / lmb-nextflow

Suggestions to improve genome download script #4