“Time Limit” error when running FunctionalAnnotation.nf on vertebrate scale genome

fmsangdehi commented 3 years ago

Hi, I would like to run Functional Annotation pipeline (FunctionalAnnotation.nf) on a genome of about 726 Mb. I tested this pipeline on one chromosome and it worked correctly, but when trying to run it on the whole genome, I run into the following error message:

Error executing process > 'functional_annotation:merge_functional_annotation (1)'

Caused by:
  Process `functional_annotation:merge_functional_annotation (1)` terminated for an unknown reason -- Likely it has been terminated by the external system

Command executed:

  agat_sp_manage_functional_annotation.pl -f Clupea_harengus.Ch_v2.0.2.104.gff3 \
      -b blast_merged.tsv -i interproscan_merged.tsv \
      -db uniprot_reviewed_yes.fasta -id NBIS \
      -pe 5 \
      -o Clupea_harengus.Ch_v2.0.2.104_plus-functional-annotation.gff

Command exit status:
  -

Command output:
  (empty)

Command wrapper:
  nxf-scratch-dir r33:/scratch/20329500/nxf.1ILctwRc1G
  WARNING: Skipping mount /var/singularity/mnt/session/etc/resolv.conf [files]: /etc/resolv.conf doesn't exist in container
  Possible precedence issue with control flow operator at /usr/local/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
  [08:55:39] 06/05/2021
  usage: /usr/local/bin/agat_sp_manage_functional_annotation.pl -f Clupea_harengus.Ch_v2.0.2.104.gff3 -b blast_merged.tsv -i interproscan_merged.tsv -db uniprot_reviewed_yes.fasta -id NBIS -pe 5 -o Clupea_harengus.Ch_v2.0.2.104_plus-fun
  ->IDs are changed using <NBIS> as prefix.
  In the case of discontinuous features (i.e. a single feature that exists over multiple genomic locations) the same ID may appear on multiple lines. All lines that share an ID collectively represent a signle feature.

  ********************************************************************************
  *                              - Start parsing -                               *
  ********************************************************************************
  -------------------------- parse options and metadata --------------------------
  => Accessing the feature level json files
  slurmstepd: error: *** JOB 20329500 ON r33 CANCELLED AT 2021-06-05T14:55:50 DUE TO TIME LIMIT ***

Could you please help me know what config needs to be changed to increase the run time required for the process? I appreciate your attention to this issue.

mahesh-panchal commented 3 years ago

Hi,

If tasks are cancelled due to time limit, then you need to overwrite the workflow default parameters using a custom configuration file.

nextflow run -c custom.config -profile uppmax <script>

where custom.config:

process {
    withName: 'merge_functional_annotation' {
        time = '1h'
    }
}

The workflow defaults are here: https://github.com/NBISweden/pipelines-nextflow/blob/master/FunctionalAnnotation/config/compute_resources.config

I see here there is no default time for the process, so you've found a bug for us. Thank you.

I'll update the code. When the code fix is merged, please pull the new version of the workflow. It should work then.

mahesh-panchal commented 3 years ago

@fmsangdehi can you pull the new version of the workflow and see if it works for you please?

fmsangdehi commented 3 years ago

Thank you very much for your reply. I will try the updated version of the workflow and will let you know if it works.

fmsangdehi commented 3 years ago

Hi,

I tried the new version of FunctionalAnnotation.nf pipeline with a configuration file according to your recommendation, but it failed.

The error message was the same as the one above, except that the last line (Time Limit error) didn’t appear this time!

It still works properly when I run it on one chromosome.

I appreciate your time and help.

mahesh-panchal commented 3 years ago

Can you copy the .nextflow.log here please?

fmsangdehi commented 3 years ago

I can't comment it here because it it too long! Can I send it by email?

mahesh-panchal commented 3 years ago

I looked at the log, but I'm not sure where the issue is.

Did you add this to your config?

process {
    withName: 'merge_functional_annotation' {
        time = '1h'
    }
}

If so, can you replace it with this please (or add it if you did not)? The aim is to increase memory, by increasing cpu allocation.

process {
    withName: 'merge_functional_annotation' {
        cpus = 4
    }
}

Maybe that will make it run through.

fmsangdehi commented 3 years ago

Thanks! I added the first piece of code to the config file in previous run. I will try the second one now.

fmsangdehi commented 3 years ago

I changed the config file according to your recommendation (increasing cpu) and obtained the same “Time limit” error again. However, I managed to run the pipeline chromosome by chromosome and that worked fine.

The pipeline was particularly helpful in identifying the genes that were left uncharacterized on Ensembl. Thank you for developing this pipeline!

I am now facing some questions regarding the gene names that it provides:

In the output gff file, some genes are assigned more than one name (which are different versions of the same gene). I was wondering if I can get a confidence value or other metrics out of the pipeline to assess how highly the predictions are supported, and to choose the best name per gene. I also noticed that the rank of mRNA isoforms in the gff output file is different from their rank on Ensembl gff, and that the order of gene names under Name tag on column nine corresponds to the order of transcript isoforms. What are the mRNA isoforms ranked based on?

Here is a simplified example of a gene with four mRNA isoforms and two gene names. This gene has already been given a name on Ensembl:

Ensembl .gff file

1   ensembl gene    18512182    18559642    .   +   .   ID=gene:ENSCHAG00000018396;Name=kcnc3a;biotype=protein_coding;description=potassium voltage-gated channel%2C Shaw-related subfamily%2C member 3a [Source:ZFIN%3BAcc:ZDB-GENE-100901-1];gene_id=ENSCHAG00000018396;logic_name=ensembl
1   ensembl mRNA    18512182    18542868    .   +   .   ID=transcript:ENSCHAT00000042524;Parent=gene:ENSCHAG00000018396;Name=kcnc3a-204;biotype=protein_coding;transcript_id=ENSCHAT00000042524
1   ensembl mRNA    18512182    18552795    .   +   .   ID=transcript:ENSCHAT00000042503;Parent=gene:ENSCHAG00000018396;Name=kcnc3a-203;biotype=protein_coding;transcript_id=ENSCHAT00000042503
1   ensembl mRNA    18512182    18559642    .   +   .   ID=transcript:ENSCHAT00000042492;Parent=gene:ENSCHAG00000018396;Name=kcnc3a-201;biotype=protein_coding;transcript_id=ENSCHAT00000042492
1   ensembl mRNA    18512182    18559642    .   +   .   ID=transcript:ENSCHAT00000042501;Parent=gene:ENSCHAG00000018396;Name=kcnc3a-202;biotype=protein_coding;transcript_id=ENSCHAT00000042501

FunctionalAnnotation pipeline .gff file

1   ensembl gene    18512182    18559642    .   +   .   ID=NBISG00000000651;Name=kcnc1_kcnc3;biotype=protein_coding;description=potassium voltage-gated channel%2C Shaw-related subfamily%2C member 3a [Source:ZFIN%3BAcc:ZDB-GENE-100901-1];gene_id=ENSCHAG00000018396;logic_name=ensembl;makerName=gene:ENSCHAG00000018396;version=1
1   ensembl mRNA    18512182    18542868    .   +   .   ID=NBISM00000001625;Parent=NBISG00000000651;Name=kcnc1_kcnc3_iso3;biotype=protein_coding;makerName=transcript:ENSCHAT00000042524;product=Potassium voltage-gated channel subfamily C member 1;transcript_id=ENSCHAT00000042524
1   ensembl mRNA    18512182    18552795    .   +   .   ID=NBISM00000001626;Parent=NBISG00000000651;Name=kcnc1_kcnc3_iso2;biotype=protein_coding;makerName=transcript:ENSCHAT00000042503;product=Potassium voltage-gated channel subfamily C member 3;transcript_id=ENSCHAT00000042503
1   ensembl mRNA    18512182    18559642    .   +   .   ID=NBISM00000001627;Parent=NBISG00000000651;Name=kcnc1_kcnc3_iso1;biotype=protein_coding;makerName=transcript:ENSCHAT00000042492;product=Potassium voltage-gated channel subfamily C member 1;transcript_id=ENSCHAT00000042492
1   ensembl mRNA    18512182    18559642    .   +   .   ID=NBISM00000001628;Parent=NBISG00000000651;Name=kcnc1_kcnc3_iso4;biotype=protein_coding;makerName=transcript:ENSCHAT00000042501;product=Potassium voltage-gated channel subfamily C member 1;transcript_id=ENSCHAT00000042501

If I understand it correctly, the pipeline BLAST the sequences in the input gff file against uniprot protein database to find gene name. Is there any possibility to find out what species the gene name comes from?

I appreciate any help and suggestion.

LucileSol commented 3 years ago

Hi!

Please see my answers below :

I changed the config file according to your recommendation (increasing cpu) and obtained the same “Time limit” error again. However, I managed to run the pipeline chromosome by chromosome and that worked fine.

The pipeline was particularly helpful in identifying the genes that were left uncharacterized on Ensembl. Thank you for developing this pipeline!

I am now facing some questions regarding the gene names that it provides:
1. In the output gff file, some genes are assigned more than one name (which are different versions of the same gene). >I was wondering if I can get a confidence value or other metrics out of the pipeline to assess how highly the predictions >are supported, and to choose the best name per gene. 

We do not really have a confidence value, one way you can have a look at which gene name to choose it to look at the output of blast, in the results output you will have a folder called blast_tsv, in it the blast results file called blast_merged.tsv. You will have all the blast results for your mRNA, for instance :

maker-LG7-exonerate_protein2genome-gene-24.13-mRNA-15   sp|F4JH46|DTX34_ARATH   73.210  377     82      1       1       358     146     522     0.0     561
maker-LG7-exonerate_protein2genome-gene-24.13-mRNA-15   sp|F4JTB3|DTX35_ARATH   61.316  380     128     1       1       361     92      471     2.85e-151       438
maker-LG7-exonerate_protein2genome-gene-24.13-mRNA-15   sp|Q9SX83|DTX33_ARATH   51.706  381     164     2       1       361     93      473     5.31e-127       376

here you can select the hit that you think fit best your gene depending on percentage identity, lenght,evalue... (here is the column names : qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore) You could also select the name of the longest isoform but I think looking at the blast will be more acurate.

I also noticed that the rank of mRNA isoforms in the gff output file is different from their rank on Ensembl gff, and that the order of gene names under Name tag on column nine corresponds to the order of transcript isoforms. What are the mRNA isoforms ranked based on?

I am not sure I understand your question, can you give a precise example or elaborate.

Here is a simplified example of a gene with four mRNA isoforms and two gene names. This gene has already been given a name on Ensembl:

Ensembl .gff file

1 ensembl gene    18512182    18559642    .   +   .   ID=gene:ENSCHAG00000018396;Name=kcnc3a;biotype=protein_coding;description=potassium voltage-gated channel%2C Shaw-related subfamily%2C member 3a [Source:ZFIN%3BAcc:ZDB-GENE-100901-1];gene_id=ENSCHAG00000018396;logic_name=ensembl
1 ensembl mRNA    18512182    18542868    .   +   .   ID=transcript:ENSCHAT00000042524;Parent=gene:ENSCHAG00000018396;Name=kcnc3a-204;biotype=protein_coding;transcript_id=ENSCHAT00000042524
1 ensembl mRNA    18512182    18552795    .   +   .   ID=transcript:ENSCHAT00000042503;Parent=gene:ENSCHAG00000018396;Name=kcnc3a-203;biotype=protein_coding;transcript_id=ENSCHAT00000042503
1 ensembl mRNA    18512182    18559642    .   +   .   ID=transcript:ENSCHAT00000042492;Parent=gene:ENSCHAG00000018396;Name=kcnc3a-201;biotype=protein_coding;transcript_id=ENSCHAT00000042492
1 ensembl mRNA    18512182    18559642    .   +   .   ID=transcript:ENSCHAT00000042501;Parent=gene:ENSCHAG00000018396;Name=kcnc3a-202;biotype=protein_coding;transcript_id=ENSCHAT00000042501

FunctionalAnnotation pipeline .gff file

1 ensembl gene    18512182    18559642    .   +   .   ID=NBISG00000000651;Name=kcnc1_kcnc3;biotype=protein_coding;description=potassium voltage-gated channel%2C Shaw-related subfamily%2C member 3a [Source:ZFIN%3BAcc:ZDB-GENE-100901-1];gene_id=ENSCHAG00000018396;logic_name=ensembl;makerName=gene:ENSCHAG00000018396;version=1
1 ensembl mRNA    18512182    18542868    .   +   .   ID=NBISM00000001625;Parent=NBISG00000000651;Name=kcnc1_kcnc3_iso3;biotype=protein_coding;makerName=transcript:ENSCHAT00000042524;product=Potassium voltage-gated channel subfamily C member 1;transcript_id=ENSCHAT00000042524
1 ensembl mRNA    18512182    18552795    .   +   .   ID=NBISM00000001626;Parent=NBISG00000000651;Name=kcnc1_kcnc3_iso2;biotype=protein_coding;makerName=transcript:ENSCHAT00000042503;product=Potassium voltage-gated channel subfamily C member 3;transcript_id=ENSCHAT00000042503
1 ensembl mRNA    18512182    18559642    .   +   .   ID=NBISM00000001627;Parent=NBISG00000000651;Name=kcnc1_kcnc3_iso1;biotype=protein_coding;makerName=transcript:ENSCHAT00000042492;product=Potassium voltage-gated channel subfamily C member 1;transcript_id=ENSCHAT00000042492
1 ensembl mRNA    18512182    18559642    .   +   .   ID=NBISM00000001628;Parent=NBISG00000000651;Name=kcnc1_kcnc3_iso4;biotype=protein_coding;makerName=transcript:ENSCHAT00000042501;product=Potassium voltage-gated channel subfamily C member 1;transcript_id=ENSCHAT00000042501

1. If I understand it correctly, the pipeline BLAST the sequences in the input gff file against uniprot protein database to find gene name. Is there any possibility to find out what species the gene name comes from?

Yes as before you can have a look at the blast output :

maker-LG7-exonerate_protein2genome-gene-24.13-mRNA-15   sp|F4JH46|DTX34_**ARATH**   73.210  377     82      1       1       358     146     522     0.0     561

I appreciate any help and suggestion.

fmsangdehi commented 3 years ago

Thank you so much! This is exactly what I was looking for!

NBISweden / pipelines-nextflow

“Time Limit” error when running FunctionalAnnotation.nf on vertebrate scale genome #58