Suggestions for improving prediction workflow documentation on https://reat.readthedocs.io/

dadrasarmin commented 2 years ago

Hi reat developers,

First, I want to thank you for developing this great tool. I participated in your workshop at EI, and I am very motivated to use reat. I had some struggles during the last few days to run the prediction workflow and I have a four suggestions for the documentation file.

I thought I could pass one of the portcullis output to --introns, however, my run failed multiple times. After some investigation, I noticed I had to provide the gff3 file instead of the bed file. I think it would be nice to mention this in the document. https://github.com/EI-CoreBioinformatics/reat/blob/be8c679aa42948607c153765edfa10fc951ff928/annotation/prediction_module/main.wdl#L1617

On the documentation link, in section the "Evidence Modeler default weights file", we can find an example like:

ABINITIO_PREDICTION     GlimmerHMM      1
ABINITIO_PREDICTION     SNAP    1
ABINITIO_PREDICTION     CodingQuarry_v2.0       1
ABINITIO_PREDICTION AUGUSTUS_RUN_ABINITIO 1
PROTEIN hq_protein_alignment    4
PROTEIN lq_protein_alignment    1
TRANSCRIPT      hq_assembly     4
TRANSCRIPT      lq_asssembly    1
TRANSCRIPT      homology_models 10
TRANSCRIPT      transcriptome_models    10
OTHER_PREDICTION        AUGUSTUS_RUN1   10
OTHER_PREDICTION        AUGUSTUS_RUN2   10
OTHER_PREDICTION        AUGUSTUS_RUN3   10

When I looked at call-EVM/execute/Scafold_name/evm.out.log, I noticed that

WARNING: not considering ev_type: lq_assembly since not included in weights file
WARNING: not considering ev_type: hq_protein since not included in weights file
WARNING, skipping predType: homology_models, since not specified in weights file.
WARNING, skipping predType: transcriptome_models, since not specified in weights file

There are a few typos here, probably some variable names changed during the development and I am not sure what to do with homology and transcriptome models, therefore, I put them under OTHER_PREDICTIONS tag for the moment:

hq_protein_alignment -> hq_protein lq_protein_alignment -> lq_protein lq_asssembly -> lq_assembly TRANSCRIPT homology_models 10 -> OTHER_PREDICTIONS homology_models 10 TRANSCRIPT transcriptome_models 10 -> OTHER_PREDICTIONS transcriptome_models 10

Strange warning/error for, $EVM_HOME/EvmUtils/write_EVM_commands.pl: For this tool, I think the EVM code is somehow broken. Because if I run just $EVM_HOME/EvmUtils/write_EVM_commands.pl, I can see that there is an option -S. However, if I run $EVM_HOME/EvmUtils/write_EVM_commands.pl -S, I get: Unknown option: S I just change the -S option from this line of the code and everything worked smoothly and without any warning/error. https://github.com/EI-CoreBioinformatics/reat/blob/be8c679aa42948607c153765edfa10fc951ff928/annotation/prediction_module/main.wdl#L1504
In the "Configuring Augustus runs" section: I found this line confusing. https://github.com/EI-CoreBioinformatics/reat/blob/3852eee72e8aa2dd51c77e50b907745d849a54ad/docs/modules/prediction/index.rst#L48 In the last section of the same document, it is mentioned that we have to name our AUGUSTUS+hints predictions like AUGUSTUS_RUN1. However, here it is written with lower case letters (augustus_run#). In my case, the job only successfully proceeds if I use the capital letter in the EVM weights, as the file names, and the same way for calling reat with --augustus_runs AUGUSTUS_RUN1.

Best regards, Armin

swarbred commented 2 years ago

@dadrasarmin

Thanks for the feedback the documentation has lagged behind the development and there are a few holes that need filling

Yes this is the intron style GFF file output along with the bed file in reat transcriptome, we will update the documentation
Yes the example weights file is not correct and we will correct, an example file would be

ABINITIO_PREDICTION GlimmerHMM  1
ABINITIO_PREDICTION SNAP    1
ABINITIO_PREDICTION AUGUSTUS_RUN_ABINITIO   1
ABINITIO_PREDICTION CodingQuarry_v2.0   1
PROTEIN hq_protein  4
PROTEIN lq_protein  1
TRANSCRIPT hq_assembly  4
TRANSCRIPT lq_assembly  1
OTHER_PREDICTION    AUGUSTUS_RUN1   10
OTHER_PREDICTION    AUGUSTUS_RUN2   10
OTHER_PREDICTION    AUGUSTUS_RUN3   10
OTHER_PREDICTION    homology_models 8
OTHER_PREDICTION    transcriptome_models    10

Yes this looks to be an EVM issue, it shouldn't cause an issue in reat but we will remove the -S option
This
"The output directory will contain a file of predictions corresponding to each :code:--augustus_runs input files, these files are named augustus_run# where # corresponds to the position of the file in the command-line argument list of run files." is describing how the output files are named under

outputs/GenePredictors/Augustus/
augustus_abinitio.gff  augustus_run1.gff  augustus_run2.gff  augustus_run3.gff

As described in https://github.com/EI-CoreBioinformatics/reat/blob/3852eee72e8aa2dd51c77e50b907745d849a54ad/docs/modules/prediction/index.rst#L26

They need to be referenced in the evm weights files as AUGUSTUS_RUN#

When providing these via --augustus_runs they can be named as desired by the user e.g. mine are named

--augustus_runs Inputs/Prediction/run1.augp Inputs/Prediction/run2.augp Inputs/Prediction/run3.augp

the order provided will determine which run corresponds to AUGUSTUS_RUN1 AUGUSTUS_RUN2 etc

Please let us know of any other issues

dadrasarmin commented 2 years ago

@swarbred Thanks for your quick reply. At the moment, I am stuck at "call-CombineEVM" step. The problem is related to this line https://github.com/EI-CoreBioinformatics/reat/blob/be8c679aa42948607c153765edfa10fc951ff928/annotation/prediction_module/main.wdl#L560

/home/armin/projects/meso_genome_annotation/prediction_workflow/cromwell-executions/ei_prediction/ef913ab4-c6e5-405a-ad66-cfcc309a8146/call-CombineEVM/execution/script: line 24: /usr/bin/cat: Argument list too long

I guess the reason behind this is that my genome has lots of scaffold and contigs. I am not sure whether I can solve the problem by replacing this line with:

$EVM_HOME/EvmUtils/convert_EVM_outputs_to_GFF3.pl  --partitions ~{partitions} --output evm.out --genome ~{genome} | awk -F',' '{printf $2"/evm.out.gff3"} END{print ""}' | xargs cat >> evm.out.gff3

Do you have any suggestions?

swarbred commented 2 years ago

@dadrasarmin

You are hitting the ARG_MAX limit, I was aware that we would likely need to change this (call-JoinAugustus is also a cat).

Perhaps the easiest way is to use xargs

$EVM_HOME/EvmUtils/convert_EVM_outputs_to_GFF3.pl  --partitions ~{partitions} --output evm.out --genome ~{genome} | awk -F',' '{printf $2"/evm.out.gff3"} END{print ""}') | xargs -r cat > evm.out.gff3

That should work but we will make the change ASAP, we are very likely going to get the same error on the wheat annotation we are running.

swarbred commented 2 years ago

@dadrasarmin I see you edited your reply to the same solution :-)

swarbred commented 2 years ago

2nd location to update https://github.com/EI-CoreBioinformatics/reat/blob/8ffc62b0e3caa52c751f6fdeba9f78d01ff20127/annotation/prediction_module/augustus.wdl#L411

dadrasarmin commented 2 years ago

@swarbred Yes, I tried a few different commands, and most did not work. However, the one I wrote (without -r) and you suggested (with -r) worked without any problem. Thanks a lot.

Best regards, Armin

dadrasarmin commented 2 years ago

One other thing that I forgot to mention. For transcriptome and homology workflow, "Configurable computational resources available" suggested on the documentation work completely fine. However, the one suggested for the prediction workflow causes an error and the program immediately stops working. Also, there is one computing resource that is necessary to run the prediction workflow, and it was not mentioned in the documentation that it is obligatory.(ei_prediction.augustus_resources). I changed the resource usage by changing the scripts instead of using --computational_resources for prediction workflow. https://github.com/EI-CoreBioinformatics/reat/blob/c8c680f75fa09e781abdac1b8bbd894fef36edc2/annotation/prediction_module/main.wdl#L55

As a comparison we can see the difference between homology: https://github.com/EI-CoreBioinformatics/reat/blob/c8c680f75fa09e781abdac1b8bbd894fef36edc2/annotation/homology_module/main.wdl#L25-L28 The transcriptome workflow: https://github.com/EI-CoreBioinformatics/reat/blob/c8c680f75fa09e781abdac1b8bbd894fef36edc2/annotation/transcriptome_module/subworkflows/mikado/wf_mikado.wdl#L28-L35 And Prediction workflow: https://github.com/EI-CoreBioinformatics/reat/blob/c8c680f75fa09e781abdac1b8bbd894fef36edc2/annotation/prediction_module/main.wdl#L568 I guess the problem could be solved by substituting RuntimeAttr? resources with subprocesses names like RuntimeAttr? ExecuteEVMCommand.resources

swarbred commented 2 years ago

@dadrasarmin

For info here is an example compute_inputs.json for prediction

We will check what is given in the documentation

The line https://github.com/EI-CoreBioinformatics/reat/blob/c8c680f75fa09e781abdac1b8bbd894fef36edc2/annotation/prediction_module/main.wdl#L568

looks correct to me as is

The doc shows

"ei_prediction.Augustus.resources": " {
               cpu_cores -> Int
              max_retries -> Int?
              boot_disk_gb -> Int?
              queue -> String?
              disk_gb -> Int?
              constraints -> String?
              mem_gb -> Float?
              preemptible_tries -> Int?
              }? (optional)",

and this should be ei_prediction.augustus_resources as shown below

{
    "ei_prediction.AlignProteins.resources": {
        "cpu_cores": 30,
        "mem_gb": 64,
        "constraints": "avx"
    },
    "ei_prediction.IndexProteinsDatabase.resources": {
        "cpu_cores": 8,
        "mem_gb": 8,
        "constraints": "avx"
    },
    "ei_prediction.SelfBlastFilter.resources": {
        "cpu_cores": 8,
        "mem_gb": 16,
        "constraints": "avx"
    },
    "ei_prediction.MikadoPick.resources": {
        "cpu_cores": 30,
        "mem_gb": 120
    },
    "ei_prediction.Mikado.resources": {
        "cpu_cores": 30,
        "mem_gb": 120
    },
    "ei_prediction.LengthChecker.resources": {
        "cpu_cores": 1,
        "mem_gb": 96
    },
    "ei_prediction.ExecuteEVMCommand.resources": {
        "cpu_cores": 1,
                "max_retries": 3,
        "mem_gb": 24
    },
    "ei_prediction.augustus_resources": {
        "cpu_cores": 1,
        "max_retries": 3,
        "mem_gb": 24
    },
    "ei_prediction.OptimiseAugustus.resources": {
        "cpu_cores": 1,
        "mem_gb": 24
    }
}

dadrasarmin commented 2 years ago

@swarbred

Besides the ei_prediction.Augustus.resources, there is another problem. If I substitute the incorrect one ei_prediction.Augustus.resources with the correct one ei_prediction.augustus_resources, the program still crashes. If I use the modified version from the documents:

{
    "ei_prediction.AlignProteins.resources": {
        "cpu_cores": 30,
        "mem_gb": 64,
        "constraints": "avx"
    },
    "ei_prediction.IndexProteinsDatabase.resources": {
        "cpu_cores": 8,
        "mem_gb": 8,
        "constraints": "avx"
    },
    "ei_prediction.SelfBlastFilter.resources": {
        "cpu_cores": 8,
        "mem_gb": 16,
        "constraints": "avx"
    },
    "ei_prediction.MikadoPick.resources": {
        "cpu_cores": 30,
        "mem_gb": 120
    },
    "ei_prediction.Mikado.resources": {
        "cpu_cores": 30,
        "mem_gb": 120
    },
    "ei_prediction.LengthChecker.resources": {
        "cpu_cores": 1,
        "mem_gb": 96
    },
    "ei_prediction.ExecuteEVMCommand.resources": {
        "cpu_cores": 1,
                "max_retries": 3,
        "mem_gb": 24
    },
    "ei_prediction.augustus_resources": {
        "cpu_cores": 1,
        "max_retries": 3,
        "mem_gb": 24
    },
    "ei_prediction.AugustusAbinitio.resources": {
        "cpu_cores": 1,
        "max_retries": 3,
        "mem_gb": 24
    },
    "ei_prediction.OptimiseAugustus.resources": {
        "cpu_cores": 1,
        "mem_gb": 24
    }
}

If I take a look at run_details.json, I can see:

"run_details.json" [noeol] 47L, 81375B                                                                                       1,1           Top
  "status": "Failed",
  "failures": [{
    "message": "Workflow input processing failed",
    "causedBy": [{
      "causedBy": [],
      "message": "WARNING: Unexpected input provided: ei_prediction.AugustusAbinitio.resources (expected inputs: [ei_prediction.evm_extra_params, ei_prediction.base_training.resources, ei_prediction.homology_models, ei_prediction.extrinsic_config, ei_prediction.HQ_assembly, ei_prediction.extra_training_models, ei_prediction.SelectAugustusTestAndTrain.min_train_models, ei_prediction.AugustusAbinitio.bronze_models, ei_prediction.train_after_optimise.resources, ei_prediction.secondstrand_exon_hints, ei_prediction.reference_genome, ei_prediction.LengthChecker.query_start_scoring_distance, ei_prediction.Mikado.min_cdna_length, ei_prediction.mikado_scoring, ei_prediction.LengthChecker.min_pct_cds_fraction, ei_prediction.AlignProteins.resources, ei_prediction.do_augustus, ei_prediction.LengthChecker.min_target_coverage_scoring_percentage, ei_prediction.SelfBlastFilter.resources, ei_prediction.LengthChecker.max_tp_utr, ei_prediction.LengthChecker.target_end_hard_filter_distance, ei_prediction.EVM_weights, ei_prediction.LengthChecker.evalue_filter, ei_prediction.augustus_runs, ei_prediction.SelfBlastFilter.identity, ei_prediction.do_glimmer, ei_prediction.SNAP.resources, ei_prediction.Mikado.resources, ei_prediction.LengthChecker.max_tp_utr_complete, ei_prediction.LengthChecker.max_single_gap_score, ei_prediction.LQ_protein_alignments, ei_prediction.SelfBlastFilter.coverage, ei_prediction.GlimmerHMM.resources, ei_prediction.LengthChecker.target_end_scoring_distance, ei_prediction.Mikado.max_intron_length, ei_prediction.snap_training, ei_prediction.AugustusAbinitio.intron_hints, ei_prediction.codon_table, ei_prediction.chunk_size, ei_prediction.augustus_resources, ei_prediction.LengthChecker.min_target_coverage_score, ei_prediction.CodingQuarry.fresh_prediction, ei_prediction.AugustusAbinitio.all_models, ei_prediction.CodingQuarryFresh.resources, ei_prediction.firststrand_exon_hints, ei_prediction.AugustusAbinitio.expressed_exon_hints, ei_prediction.flank, ei_prediction.MikadoPick.resources, ei_prediction.LengthChecker.min_query_coverage_score, ei_prediction.intron_hints, ei_prediction.SelectAugustusTestAndTrain.force, ei_prediction.AugustusAbinitio.lq_protein_alignment_models, ei_prediction.LengthChecker.target_start_hard_filter_distance, ei_prediction.LengthChecker.target_end_score, ei_prediction.AugustusAbinitio.lq_assembly_models, ei_prediction.transcriptome_models, ei_prediction.SelfBlastFilter.top_n, ei_prediction.LengthChecker.query_end_hard_filter_distance, ei_prediction.do_codingquarry, ei_prediction.glimmer_training, ei_prediction.AugustusAbinitio.hints_source_and_priority, ei_prediction.IndexProteinsDatabase.resources, ei_prediction.AugustusAbinitio.silver_models, ei_prediction.species, ei_prediction.MikadoPick.extra_config, ei_prediction.unstranded_exon_hints, ei_prediction.AugustusAbinitio.homology_models, ei_prediction.protein_validation_database, ei_prediction.snap_extra_params, ei_prediction.LQ_assembly, ei_prediction.LengthChecker.query_start_hard_filter_distance, ei_prediction.LengthChecker.max_single_gap_hard_filter, ei_prediction.SelectAugustusTestAndTrain.target_mono_exonic_percentage, ei_prediction.LengthChecker.resources, ei_prediction.LengthChecker.min_fp_utr, ei_prediction.augustus_config_path, ei_prediction.LengthChecker.max_single_gap_scoring_length, ei_prediction.LengthChecker.target_start_score, ei_prediction.kfold, ei_prediction.ExecuteEVMCommand.resources, ei_prediction.LengthChecker.max_fp_utr, ei_prediction.optimise_augustus, ei_prediction.LengthChecker.query_start_score, ei_prediction.CodingQuarry.resources, ei_prediction.AugustusAbinitio.hq_assembly_models, ei_prediction.AugustusAbinitio.gold_models, ei_prediction.mikado_utr_files, ei_prediction.LengthChecker.query_end_scoring_distance, ei_prediction.overlap_size, ei_prediction.LengthChecker.max_fp_utr_complete, ei_prediction.codingquarry_extra_params, ei_prediction.codingquarry_training, ei_prediction.repeats_gff, ei_prediction.AugustusAbinitio.hq_protein_alignment_models, ei_prediction.force_train, ei_prediction.OptimiseAugustus.resources, ei_prediction.SelectAugustusTestAndTrain.max_test_models, ei_prediction.LengthChecker.query_end_score, ei_prediction.HQ_protein_alignments, ei_prediction.mikado_config, ei_prediction.do_snap, ei_prediction.LengthChecker.min_tp_utr, ei_prediction.LengthChecker.min_target_coverage_hard_filter, ei_prediction.LengthChecker.min_query_coverage_scoring_percentage, ei_prediction.glimmer_extra_params, ei_prediction.LengthChecker.min_query_coverage_hard_filter, ei_prediction.SelectAugustusTestAndTrain.max_train_models, ei_prediction.LengthChecker.target_start_scoring_distance, ei_prediction.augustus_extra_params])"
    }]

Therefore, "ei_prediction.AugustusAbinitio.resources" should be removed from the documentation.

gemygk commented 2 years ago

Hi @dadrasarmin

I appreciate your detailed checks.

We are doing to necessary changes as we speak and carrying out the tests before merging the changes.

The ei_prediction.AugustusAbinitio.resources is failing because it should have been ei_prediction.AugustusAbinitio.augustus_resources

dadrasarmin commented 2 years ago

Hi @gemygk,

As I said in my first comment, I really appreciate what you have built and the results I got recently are excellent. That is why I investigated the codes in more detail and found a few suggestions. Thanks for sharing the correct way of giving the AugustusAbinitio computing resources. I gave up last week and changed the default values in the code instead of using --computational_resources, which is not a good practice in the long term.

gemygk commented 2 years ago

Thanks @dadrasarmin . It is the result of hard work from @ljyanesm and the team.

Please do let us know if you have any issues. In the meantime, we will try to get these changes into the repo.

EI-CoreBioinformatics / reat

Suggestions for improving prediction workflow documentation on https://reat.readthedocs.io/ #35