Open dadrasarmin opened 2 years ago
@dadrasarmin
Thanks for the feedback the documentation has lagged behind the development and there are a few holes that need filling
ABINITIO_PREDICTION GlimmerHMM 1
ABINITIO_PREDICTION SNAP 1
ABINITIO_PREDICTION AUGUSTUS_RUN_ABINITIO 1
ABINITIO_PREDICTION CodingQuarry_v2.0 1
PROTEIN hq_protein 4
PROTEIN lq_protein 1
TRANSCRIPT hq_assembly 4
TRANSCRIPT lq_assembly 1
OTHER_PREDICTION AUGUSTUS_RUN1 10
OTHER_PREDICTION AUGUSTUS_RUN2 10
OTHER_PREDICTION AUGUSTUS_RUN3 10
OTHER_PREDICTION homology_models 8
OTHER_PREDICTION transcriptome_models 10
--augustus_runs
input files, these files are named augustus_run#
where #
corresponds to the position of the file in the command-line argument list of run files."
is describing how the output files are named underoutputs/GenePredictors/Augustus/
augustus_abinitio.gff augustus_run1.gff augustus_run2.gff augustus_run3.gff
As described in https://github.com/EI-CoreBioinformatics/reat/blob/3852eee72e8aa2dd51c77e50b907745d849a54ad/docs/modules/prediction/index.rst#L26
They need to be referenced in the evm weights files as AUGUSTUS_RUN#
When providing these via --augustus_runs
they can be named as desired by the user e.g. mine are named
--augustus_runs Inputs/Prediction/run1.augp Inputs/Prediction/run2.augp Inputs/Prediction/run3.augp
the order provided will determine which run corresponds to AUGUSTUS_RUN1 AUGUSTUS_RUN2 etc
Please let us know of any other issues
@swarbred Thanks for your quick reply. At the moment, I am stuck at "call-CombineEVM" step. The problem is related to this line https://github.com/EI-CoreBioinformatics/reat/blob/be8c679aa42948607c153765edfa10fc951ff928/annotation/prediction_module/main.wdl#L560
/home/armin/projects/meso_genome_annotation/prediction_workflow/cromwell-executions/ei_prediction/ef913ab4-c6e5-405a-ad66-cfcc309a8146/call-CombineEVM/execution/script: line 24: /usr/bin/cat: Argument list too long
I guess the reason behind this is that my genome has lots of scaffold and contigs. I am not sure whether I can solve the problem by replacing this line with:
$EVM_HOME/EvmUtils/convert_EVM_outputs_to_GFF3.pl --partitions ~{partitions} --output evm.out --genome ~{genome} | awk -F',' '{printf $2"/evm.out.gff3"} END{print ""}' | xargs cat >> evm.out.gff3
Do you have any suggestions?
@dadrasarmin
You are hitting the ARG_MAX limit, I was aware that we would likely need to change this (call-JoinAugustus is also a cat).
Perhaps the easiest way is to use xargs
$EVM_HOME/EvmUtils/convert_EVM_outputs_to_GFF3.pl --partitions ~{partitions} --output evm.out --genome ~{genome} | awk -F',' '{printf $2"/evm.out.gff3"} END{print ""}') | xargs -r cat > evm.out.gff3
That should work but we will make the change ASAP, we are very likely going to get the same error on the wheat annotation we are running.
@dadrasarmin I see you edited your reply to the same solution :-)
@swarbred Yes, I tried a few different commands, and most did not work. However, the one I wrote (without -r) and you suggested (with -r) worked without any problem. Thanks a lot.
Best regards, Armin
One other thing that I forgot to mention. For transcriptome and homology workflow, "Configurable computational resources available" suggested on the documentation work completely fine. However, the one suggested for the prediction workflow causes an error and the program immediately stops working. Also, there is one computing resource that is necessary to run the prediction workflow, and it was not mentioned in the documentation that it is obligatory.(ei_prediction.augustus_resources). I changed the resource usage by changing the scripts instead of using --computational_resources
for prediction workflow.
https://github.com/EI-CoreBioinformatics/reat/blob/c8c680f75fa09e781abdac1b8bbd894fef36edc2/annotation/prediction_module/main.wdl#L55
As a comparison we can see the difference between homology:
https://github.com/EI-CoreBioinformatics/reat/blob/c8c680f75fa09e781abdac1b8bbd894fef36edc2/annotation/homology_module/main.wdl#L25-L28
The transcriptome workflow:
https://github.com/EI-CoreBioinformatics/reat/blob/c8c680f75fa09e781abdac1b8bbd894fef36edc2/annotation/transcriptome_module/subworkflows/mikado/wf_mikado.wdl#L28-L35
And Prediction workflow:
https://github.com/EI-CoreBioinformatics/reat/blob/c8c680f75fa09e781abdac1b8bbd894fef36edc2/annotation/prediction_module/main.wdl#L568
I guess the problem could be solved by substituting RuntimeAttr? resources
with subprocesses names like RuntimeAttr? ExecuteEVMCommand.resources
@dadrasarmin
For info here is an example compute_inputs.json for prediction
We will check what is given in the documentation
looks correct to me as is
The doc shows
"ei_prediction.Augustus.resources": " {
cpu_cores -> Int
max_retries -> Int?
boot_disk_gb -> Int?
queue -> String?
disk_gb -> Int?
constraints -> String?
mem_gb -> Float?
preemptible_tries -> Int?
}? (optional)",
and this should be ei_prediction.augustus_resources
as shown below
{
"ei_prediction.AlignProteins.resources": {
"cpu_cores": 30,
"mem_gb": 64,
"constraints": "avx"
},
"ei_prediction.IndexProteinsDatabase.resources": {
"cpu_cores": 8,
"mem_gb": 8,
"constraints": "avx"
},
"ei_prediction.SelfBlastFilter.resources": {
"cpu_cores": 8,
"mem_gb": 16,
"constraints": "avx"
},
"ei_prediction.MikadoPick.resources": {
"cpu_cores": 30,
"mem_gb": 120
},
"ei_prediction.Mikado.resources": {
"cpu_cores": 30,
"mem_gb": 120
},
"ei_prediction.LengthChecker.resources": {
"cpu_cores": 1,
"mem_gb": 96
},
"ei_prediction.ExecuteEVMCommand.resources": {
"cpu_cores": 1,
"max_retries": 3,
"mem_gb": 24
},
"ei_prediction.augustus_resources": {
"cpu_cores": 1,
"max_retries": 3,
"mem_gb": 24
},
"ei_prediction.OptimiseAugustus.resources": {
"cpu_cores": 1,
"mem_gb": 24
}
}
@swarbred
Besides the ei_prediction.Augustus.resources
, there is another problem. If I substitute the incorrect one ei_prediction.Augustus.resources
with the correct one ei_prediction.augustus_resources
, the program still crashes. If I use the modified version from the documents:
{
"ei_prediction.AlignProteins.resources": {
"cpu_cores": 30,
"mem_gb": 64,
"constraints": "avx"
},
"ei_prediction.IndexProteinsDatabase.resources": {
"cpu_cores": 8,
"mem_gb": 8,
"constraints": "avx"
},
"ei_prediction.SelfBlastFilter.resources": {
"cpu_cores": 8,
"mem_gb": 16,
"constraints": "avx"
},
"ei_prediction.MikadoPick.resources": {
"cpu_cores": 30,
"mem_gb": 120
},
"ei_prediction.Mikado.resources": {
"cpu_cores": 30,
"mem_gb": 120
},
"ei_prediction.LengthChecker.resources": {
"cpu_cores": 1,
"mem_gb": 96
},
"ei_prediction.ExecuteEVMCommand.resources": {
"cpu_cores": 1,
"max_retries": 3,
"mem_gb": 24
},
"ei_prediction.augustus_resources": {
"cpu_cores": 1,
"max_retries": 3,
"mem_gb": 24
},
"ei_prediction.AugustusAbinitio.resources": {
"cpu_cores": 1,
"max_retries": 3,
"mem_gb": 24
},
"ei_prediction.OptimiseAugustus.resources": {
"cpu_cores": 1,
"mem_gb": 24
}
}
If I take a look at run_details.json
, I can see:
"run_details.json" [noeol] 47L, 81375B 1,1 Top
"status": "Failed",
"failures": [{
"message": "Workflow input processing failed",
"causedBy": [{
"causedBy": [],
"message": "WARNING: Unexpected input provided: ei_prediction.AugustusAbinitio.resources (expected inputs: [ei_prediction.evm_extra_params, ei_prediction.base_training.resources, ei_prediction.homology_models, ei_prediction.extrinsic_config, ei_prediction.HQ_assembly, ei_prediction.extra_training_models, ei_prediction.SelectAugustusTestAndTrain.min_train_models, ei_prediction.AugustusAbinitio.bronze_models, ei_prediction.train_after_optimise.resources, ei_prediction.secondstrand_exon_hints, ei_prediction.reference_genome, ei_prediction.LengthChecker.query_start_scoring_distance, ei_prediction.Mikado.min_cdna_length, ei_prediction.mikado_scoring, ei_prediction.LengthChecker.min_pct_cds_fraction, ei_prediction.AlignProteins.resources, ei_prediction.do_augustus, ei_prediction.LengthChecker.min_target_coverage_scoring_percentage, ei_prediction.SelfBlastFilter.resources, ei_prediction.LengthChecker.max_tp_utr, ei_prediction.LengthChecker.target_end_hard_filter_distance, ei_prediction.EVM_weights, ei_prediction.LengthChecker.evalue_filter, ei_prediction.augustus_runs, ei_prediction.SelfBlastFilter.identity, ei_prediction.do_glimmer, ei_prediction.SNAP.resources, ei_prediction.Mikado.resources, ei_prediction.LengthChecker.max_tp_utr_complete, ei_prediction.LengthChecker.max_single_gap_score, ei_prediction.LQ_protein_alignments, ei_prediction.SelfBlastFilter.coverage, ei_prediction.GlimmerHMM.resources, ei_prediction.LengthChecker.target_end_scoring_distance, ei_prediction.Mikado.max_intron_length, ei_prediction.snap_training, ei_prediction.AugustusAbinitio.intron_hints, ei_prediction.codon_table, ei_prediction.chunk_size, ei_prediction.augustus_resources, ei_prediction.LengthChecker.min_target_coverage_score, ei_prediction.CodingQuarry.fresh_prediction, ei_prediction.AugustusAbinitio.all_models, ei_prediction.CodingQuarryFresh.resources, ei_prediction.firststrand_exon_hints, ei_prediction.AugustusAbinitio.expressed_exon_hints, ei_prediction.flank, ei_prediction.MikadoPick.resources, ei_prediction.LengthChecker.min_query_coverage_score, ei_prediction.intron_hints, ei_prediction.SelectAugustusTestAndTrain.force, ei_prediction.AugustusAbinitio.lq_protein_alignment_models, ei_prediction.LengthChecker.target_start_hard_filter_distance, ei_prediction.LengthChecker.target_end_score, ei_prediction.AugustusAbinitio.lq_assembly_models, ei_prediction.transcriptome_models, ei_prediction.SelfBlastFilter.top_n, ei_prediction.LengthChecker.query_end_hard_filter_distance, ei_prediction.do_codingquarry, ei_prediction.glimmer_training, ei_prediction.AugustusAbinitio.hints_source_and_priority, ei_prediction.IndexProteinsDatabase.resources, ei_prediction.AugustusAbinitio.silver_models, ei_prediction.species, ei_prediction.MikadoPick.extra_config, ei_prediction.unstranded_exon_hints, ei_prediction.AugustusAbinitio.homology_models, ei_prediction.protein_validation_database, ei_prediction.snap_extra_params, ei_prediction.LQ_assembly, ei_prediction.LengthChecker.query_start_hard_filter_distance, ei_prediction.LengthChecker.max_single_gap_hard_filter, ei_prediction.SelectAugustusTestAndTrain.target_mono_exonic_percentage, ei_prediction.LengthChecker.resources, ei_prediction.LengthChecker.min_fp_utr, ei_prediction.augustus_config_path, ei_prediction.LengthChecker.max_single_gap_scoring_length, ei_prediction.LengthChecker.target_start_score, ei_prediction.kfold, ei_prediction.ExecuteEVMCommand.resources, ei_prediction.LengthChecker.max_fp_utr, ei_prediction.optimise_augustus, ei_prediction.LengthChecker.query_start_score, ei_prediction.CodingQuarry.resources, ei_prediction.AugustusAbinitio.hq_assembly_models, ei_prediction.AugustusAbinitio.gold_models, ei_prediction.mikado_utr_files, ei_prediction.LengthChecker.query_end_scoring_distance, ei_prediction.overlap_size, ei_prediction.LengthChecker.max_fp_utr_complete, ei_prediction.codingquarry_extra_params, ei_prediction.codingquarry_training, ei_prediction.repeats_gff, ei_prediction.AugustusAbinitio.hq_protein_alignment_models, ei_prediction.force_train, ei_prediction.OptimiseAugustus.resources, ei_prediction.SelectAugustusTestAndTrain.max_test_models, ei_prediction.LengthChecker.query_end_score, ei_prediction.HQ_protein_alignments, ei_prediction.mikado_config, ei_prediction.do_snap, ei_prediction.LengthChecker.min_tp_utr, ei_prediction.LengthChecker.min_target_coverage_hard_filter, ei_prediction.LengthChecker.min_query_coverage_scoring_percentage, ei_prediction.glimmer_extra_params, ei_prediction.LengthChecker.min_query_coverage_hard_filter, ei_prediction.SelectAugustusTestAndTrain.max_train_models, ei_prediction.LengthChecker.target_start_scoring_distance, ei_prediction.augustus_extra_params])"
}]
Therefore, "ei_prediction.AugustusAbinitio.resources"
should be removed from the documentation.
Hi @dadrasarmin
I appreciate your detailed checks.
We are doing to necessary changes as we speak and carrying out the tests before merging the changes.
The ei_prediction.AugustusAbinitio.resources
is failing because it should have been ei_prediction.AugustusAbinitio.augustus_resources
Hi @gemygk,
As I said in my first comment, I really appreciate what you have built and the results I got recently are excellent. That is why I investigated the codes in more detail and found a few suggestions.
Thanks for sharing the correct way of giving the AugustusAbinitio computing resources. I gave up last week and changed the default values in the code instead of using --computational_resources
, which is not a good practice in the long term.
Thanks @dadrasarmin . It is the result of hard work from @ljyanesm and the team.
Please do let us know if you have any issues. In the meantime, we will try to get these changes into the repo.
Hi reat developers,
First, I want to thank you for developing this great tool. I participated in your workshop at EI, and I am very motivated to use reat. I had some struggles during the last few days to run the prediction workflow and I have a four suggestions for the documentation file.
I thought I could pass one of the portcullis output to
--introns
, however, my run failed multiple times. After some investigation, I noticed I had to provide the gff3 file instead of the bed file. I think it would be nice to mention this in the document. https://github.com/EI-CoreBioinformatics/reat/blob/be8c679aa42948607c153765edfa10fc951ff928/annotation/prediction_module/main.wdl#L1617On the documentation link, in section the "Evidence Modeler default weights file", we can find an example like:
When I looked at
call-EVM/execute/Scafold_name/evm.out.log
, I noticed thatThere are a few typos here, probably some variable names changed during the development and I am not sure what to do with homology and transcriptome models, therefore, I put them under OTHER_PREDICTIONS tag for the moment:
hq_protein_alignment -> hq_protein lq_protein_alignment -> lq_protein lq_asssembly -> lq_assembly TRANSCRIPT homology_models 10 -> OTHER_PREDICTIONS homology_models 10 TRANSCRIPT transcriptome_models 10 -> OTHER_PREDICTIONS transcriptome_models 10
Strange warning/error for,
$EVM_HOME/EvmUtils/write_EVM_commands.pl
: For this tool, I think the EVM code is somehow broken. Because if I run just$EVM_HOME/EvmUtils/write_EVM_commands.pl
, I can see that there is an option-S
. However, if I run$EVM_HOME/EvmUtils/write_EVM_commands.pl -S
, I get:Unknown option: S
I just change the -S option from this line of the code and everything worked smoothly and without any warning/error. https://github.com/EI-CoreBioinformatics/reat/blob/be8c679aa42948607c153765edfa10fc951ff928/annotation/prediction_module/main.wdl#L1504In the "Configuring Augustus runs" section: I found this line confusing. https://github.com/EI-CoreBioinformatics/reat/blob/3852eee72e8aa2dd51c77e50b907745d849a54ad/docs/modules/prediction/index.rst#L48 In the last section of the same document, it is mentioned that we have to name our AUGUSTUS+hints predictions like AUGUSTUS_RUN1. However, here it is written with lower case letters (augustus_run#). In my case, the job only successfully proceeds if I use the capital letter in the EVM weights, as the file names, and the same way for calling reat with
--augustus_runs AUGUSTUS_RUN1
.Best regards, Armin