Closed dadrasarmin closed 2 years ago
Hi @dadrasarmin
On your first point I can give you a bit of guidance based on one recent fish run. As the inputs are generally small into the homology pipeline the resources are not large and the default compute inputs file I use shown below is going to be overkill for most projects.
This is a 1Gb genome, the inputs were the GFF and assemblies for 9 species (where possible we provide the GFF and genome and the pipeline extracts the proteins for alignment as this allows additional metrics to be calculated. There were ~35-60K proteins for each species 420K total.
This is the command used to run reat homology
srun reat --jar_cromwell Inputs/Configs/cromwell.jar --runtime_configuration Inputs/Configs/cromwell_noserver_slurm.conf --workflow_options_file ./options.json --computational_resources compute_inputs.json homology --genome Inputs/Reference/O_niloticus_abbassa_EIv1.0.fasta --annotations_csv sample_inputs.csv --annotation_filters aa_len exon_len --alignment_species oreonilo --filter_max_intron 200000 --filter_min_exon 10 --alignment_filters aa_len internal_stop intron_len exon_len splicing --alignment_min_coverage 90 --junction_f1_filter 40 --mikado_config config_reat_hom_mammalian.yaml --mikado_scoring scoring_reat_hom_mammalian_alt1.yaml --junctions Inputs/Homology/portcullis.pass.merged.bed --utrs Inputs/Homology/mikado_all.loci.run2.gff3
Below is my default compute inputs file
cat compute_inputs.json { "ei_homology.index_attr": { "cpu_cores": 16 }, "ei_homology.aln_attr": { "cpu_cores": 24, "max_retries": 2 }, "ei_homology.score_attr": { "cpu_cores": 16, "mem_gb": 120 }, "ei_homology.mikado_attr": { "cpu_cores": 24, "mem_gb": 80 } }
For the steps I’ve just added the memory actually used call-IndexGenome – 4GB call-PrepareAnnotations – 224MB call-AlignProteins – 3.5GB call-PrepareAlignments – 316MB call-ScoreAlignments – 6.5GB call-CombineResults – 1.3GB call-ScoreSummary – 1.5MB call-CombineXspecies – 1.3GB call-Mikado – 3.2GB call-MikadoPick -10GB call-MikadoSummaryStats
I have attached the Mikado config and scoring files used in this run, note the scoring has the extra metrics provided as attributes in the GFF files that the pipeline creates (e.g. attributes.avg_jf1). For species with smaller introns / more compact genomes then intron/UTR metrics should be updated.
scoring_reat_hom_mammalian_alt1.yaml.txt config_reat_hom_mammalian.yaml.txt
On point 2 we are not doing anything to gather info on compute resource you would need to query your job scheduler to see memory usage. The output log should provide something useful to help pin point issues if errors occur.
example mikado config and scoring file for a reat homology run with a more compact genome / smaller intron size
scoring_reat_hom_small_intron.yaml.txt config_reat_hom_small_intron.yaml.txt s
Hi @swarbred,
Thank you very much for your detailed answer and for providing the scoring and config files. They helped me to fix the problem of jobs running endlessly on our machine.
Best regards, Armin
Dear developers,
I am trying to run a reat homology job on an HPC. I have two questions:
Best, Armin