EI-CoreBioinformatics / reat

Robust Eukaryotic Annotation Toolkit
https://reat.readthedocs.io/en/latest/
MIT License
17 stars 3 forks source link

Reat Homology - Runtime, hardware requirement estimation, verbose, log #29

Closed dadrasarmin closed 2 years ago

dadrasarmin commented 2 years ago

Dear developers,

I am trying to run a reat homology job on an HPC. I have two questions:

  1. Could you please provide a rough estimation for a specified hardware configuration and a specific genome size, with specific inputs (how many protein sequences and genomes and gff files) and the time needed to finish a run? Of course, if you offer more than one it is easier to find a relationship between time, hardware, and runtime.
  2. Is there a way to find which steps of the pipeline have been finished at any point in time (while the task is running or after finishing or failing due to the time limit)? Also, it is useful to know how many hours each step took to finish and what was the picks of CPU, and memory usage. Is this information foundable via built-in utilities of reat?

Best, Armin

swarbred commented 2 years ago

Hi @dadrasarmin

On your first point I can give you a bit of guidance based on one recent fish run. As the inputs are generally small into the homology pipeline the resources are not large and the default compute inputs file I use shown below is going to be overkill for most projects.

This is a 1Gb genome, the inputs were the GFF and assemblies for 9 species (where possible we provide the GFF and genome and the pipeline extracts the proteins for alignment as this allows additional metrics to be calculated. There were ~35-60K proteins for each species 420K total.

This is the command used to run reat homology

srun reat --jar_cromwell Inputs/Configs/cromwell.jar --runtime_configuration Inputs/Configs/cromwell_noserver_slurm.conf --workflow_options_file ./options.json --computational_resources compute_inputs.json homology --genome Inputs/Reference/O_niloticus_abbassa_EIv1.0.fasta --annotations_csv sample_inputs.csv --annotation_filters aa_len exon_len --alignment_species oreonilo --filter_max_intron 200000 --filter_min_exon 10 --alignment_filters aa_len internal_stop intron_len exon_len splicing --alignment_min_coverage 90 --junction_f1_filter 40 --mikado_config config_reat_hom_mammalian.yaml --mikado_scoring scoring_reat_hom_mammalian_alt1.yaml --junctions Inputs/Homology/portcullis.pass.merged.bed --utrs Inputs/Homology/mikado_all.loci.run2.gff3

Below is my default compute inputs file

cat compute_inputs.json { "ei_homology.index_attr": { "cpu_cores": 16 }, "ei_homology.aln_attr": { "cpu_cores": 24, "max_retries": 2 }, "ei_homology.score_attr": { "cpu_cores": 16, "mem_gb": 120 }, "ei_homology.mikado_attr": { "cpu_cores": 24, "mem_gb": 80 } }

For the steps I’ve just added the memory actually used call-IndexGenome – 4GB call-PrepareAnnotations – 224MB call-AlignProteins – 3.5GB call-PrepareAlignments – 316MB call-ScoreAlignments – 6.5GB call-CombineResults – 1.3GB call-ScoreSummary – 1.5MB call-CombineXspecies – 1.3GB call-Mikado – 3.2GB call-MikadoPick -10GB call-MikadoSummaryStats

I have attached the Mikado config and scoring files used in this run, note the scoring has the extra metrics provided as attributes in the GFF files that the pipeline creates (e.g. attributes.avg_jf1). For species with smaller introns / more compact genomes then intron/UTR metrics should be updated.

scoring_reat_hom_mammalian_alt1.yaml.txt config_reat_hom_mammalian.yaml.txt

swarbred commented 2 years ago

On point 2 we are not doing anything to gather info on compute resource you would need to query your job scheduler to see memory usage. The output log should provide something useful to help pin point issues if errors occur.

swarbred commented 2 years ago

example mikado config and scoring file for a reat homology run with a more compact genome / smaller intron size

scoring_reat_hom_small_intron.yaml.txt config_reat_hom_small_intron.yaml.txt s

dadrasarmin commented 2 years ago

Hi @swarbred,

Thank you very much for your detailed answer and for providing the scoring and config files. They helped me to fix the problem of jobs running endlessly on our machine.

Best regards, Armin