StevenWingett / lmb-nextflow

Configuration files and other information regarding the Nextflow setup at the LMB
0 stars 3 forks source link

Suggestions to improve genome download script #4

Open StevenWingett opened 1 year ago

StevenWingett commented 1 year ago

Just finished my rescheduled meeting. The format we agreed is:

The existing species_releases folder will have per species folders in it. When you're ready to start downloading you'll move the existing ones into old_data, but leave nextflow there till you've rewritten code.

Inside the species folder structure will look like this:

Ensembl

GRCh38

    Release 103

        BED

        FASTA

        GTF

        INDEXES

In the FASTA folder you will keep the original file names, but duplicate the genome file to a standard renaming scheme for nextflow. You'll add .genome or similar to it to indicate what this file is. You'll also download the cDNA fasta to this folder.

For indexes possibilities will be Bowtie, Bowtie2, Hisat2, STAR (both the version in the genomics/soft/bin and the nextflow version, folder names should indicate which version they were made with), Hi-CUP, 10X, PARSE. Not all will be made for all species - for all releases? Can your code be release specific?

It would be good to add the Human T2T assembly as an option as well.

StevenWingett commented 1 year ago
  1. Put aligner version in folder name for indices
  2. Split genome / cdna
  3. Genome size file in genome folder
  4. BED - just make empty folder for now
  5. GFF3 file
  6. Describe how it works in documentation and remember version used
StevenWingett commented 1 year ago

Let's not duplicate the FASTQ file sfor different releases - maybe make symbolic links to the original downloaded FASTA files