Closed fmsangdehi closed 2 years ago
Hi,
If tasks are cancelled due to time limit, then you need to overwrite the workflow default parameters using a custom configuration file.
nextflow run -c custom.config -profile uppmax <script>
where custom.config
:
process {
withName: 'merge_functional_annotation' {
time = '1h'
}
}
The workflow defaults are here: https://github.com/NBISweden/pipelines-nextflow/blob/master/FunctionalAnnotation/config/compute_resources.config
I see here there is no default time for the process, so you've found a bug for us. Thank you.
I'll update the code. When the code fix is merged, please pull the new version of the workflow. It should work then.
@fmsangdehi can you pull the new version of the workflow and see if it works for you please?
Thank you very much for your reply. I will try the updated version of the workflow and will let you know if it works.
Hi,
I tried the new version of FunctionalAnnotation.nf
pipeline with a configuration file according to your recommendation, but it failed.
The error message was the same as the one above, except that the last line (Time Limit error) didn’t appear this time!
It still works properly when I run it on one chromosome.
I appreciate your time and help.
Can you copy the .nextflow.log
here please?
I can't comment it here because it it too long! Can I send it by email?
I looked at the log, but I'm not sure where the issue is.
Did you add this to your config?
process {
withName: 'merge_functional_annotation' {
time = '1h'
}
}
If so, can you replace it with this please (or add it if you did not)? The aim is to increase memory, by increasing cpu allocation.
process {
withName: 'merge_functional_annotation' {
cpus = 4
}
}
Maybe that will make it run through.
Thanks! I added the first piece of code to the config file in previous run. I will try the second one now.
I changed the config file according to your recommendation (increasing cpu) and obtained the same “Time limit” error again. However, I managed to run the pipeline chromosome by chromosome and that worked fine.
The pipeline was particularly helpful in identifying the genes that were left uncharacterized on Ensembl. Thank you for developing this pipeline!
I am now facing some questions regarding the gene names that it provides:
Here is a simplified example of a gene with four mRNA isoforms and two gene names. This gene has already been given a name on Ensembl:
Ensembl .gff file
1 ensembl gene 18512182 18559642 . + . ID=gene:ENSCHAG00000018396;Name=kcnc3a;biotype=protein_coding;description=potassium voltage-gated channel%2C Shaw-related subfamily%2C member 3a [Source:ZFIN%3BAcc:ZDB-GENE-100901-1];gene_id=ENSCHAG00000018396;logic_name=ensembl
1 ensembl mRNA 18512182 18542868 . + . ID=transcript:ENSCHAT00000042524;Parent=gene:ENSCHAG00000018396;Name=kcnc3a-204;biotype=protein_coding;transcript_id=ENSCHAT00000042524
1 ensembl mRNA 18512182 18552795 . + . ID=transcript:ENSCHAT00000042503;Parent=gene:ENSCHAG00000018396;Name=kcnc3a-203;biotype=protein_coding;transcript_id=ENSCHAT00000042503
1 ensembl mRNA 18512182 18559642 . + . ID=transcript:ENSCHAT00000042492;Parent=gene:ENSCHAG00000018396;Name=kcnc3a-201;biotype=protein_coding;transcript_id=ENSCHAT00000042492
1 ensembl mRNA 18512182 18559642 . + . ID=transcript:ENSCHAT00000042501;Parent=gene:ENSCHAG00000018396;Name=kcnc3a-202;biotype=protein_coding;transcript_id=ENSCHAT00000042501
FunctionalAnnotation pipeline .gff file
1 ensembl gene 18512182 18559642 . + . ID=NBISG00000000651;Name=kcnc1_kcnc3;biotype=protein_coding;description=potassium voltage-gated channel%2C Shaw-related subfamily%2C member 3a [Source:ZFIN%3BAcc:ZDB-GENE-100901-1];gene_id=ENSCHAG00000018396;logic_name=ensembl;makerName=gene:ENSCHAG00000018396;version=1
1 ensembl mRNA 18512182 18542868 . + . ID=NBISM00000001625;Parent=NBISG00000000651;Name=kcnc1_kcnc3_iso3;biotype=protein_coding;makerName=transcript:ENSCHAT00000042524;product=Potassium voltage-gated channel subfamily C member 1;transcript_id=ENSCHAT00000042524
1 ensembl mRNA 18512182 18552795 . + . ID=NBISM00000001626;Parent=NBISG00000000651;Name=kcnc1_kcnc3_iso2;biotype=protein_coding;makerName=transcript:ENSCHAT00000042503;product=Potassium voltage-gated channel subfamily C member 3;transcript_id=ENSCHAT00000042503
1 ensembl mRNA 18512182 18559642 . + . ID=NBISM00000001627;Parent=NBISG00000000651;Name=kcnc1_kcnc3_iso1;biotype=protein_coding;makerName=transcript:ENSCHAT00000042492;product=Potassium voltage-gated channel subfamily C member 1;transcript_id=ENSCHAT00000042492
1 ensembl mRNA 18512182 18559642 . + . ID=NBISM00000001628;Parent=NBISG00000000651;Name=kcnc1_kcnc3_iso4;biotype=protein_coding;makerName=transcript:ENSCHAT00000042501;product=Potassium voltage-gated channel subfamily C member 1;transcript_id=ENSCHAT00000042501
I appreciate any help and suggestion.
Hi!
Please see my answers below :
I changed the config file according to your recommendation (increasing cpu) and obtained the same “Time limit” error again. However, I managed to run the pipeline chromosome by chromosome and that worked fine.
The pipeline was particularly helpful in identifying the genes that were left uncharacterized on Ensembl. Thank you for developing this pipeline!
I am now facing some questions regarding the gene names that it provides:
1. In the output gff file, some genes are assigned more than one name (which are different versions of the same gene). >I was wondering if I can get a confidence value or other metrics out of the pipeline to assess how highly the predictions >are supported, and to choose the best name per gene.
We do not really have a confidence value, one way you can have a look at which gene name to choose it to look at the output of blast, in the results output you will have a folder called blast_tsv, in it the blast results file called blast_merged.tsv. You will have all the blast results for your mRNA, for instance :
maker-LG7-exonerate_protein2genome-gene-24.13-mRNA-15 sp|F4JH46|DTX34_ARATH 73.210 377 82 1 1 358 146 522 0.0 561
maker-LG7-exonerate_protein2genome-gene-24.13-mRNA-15 sp|F4JTB3|DTX35_ARATH 61.316 380 128 1 1 361 92 471 2.85e-151 438
maker-LG7-exonerate_protein2genome-gene-24.13-mRNA-15 sp|Q9SX83|DTX33_ARATH 51.706 381 164 2 1 361 93 473 5.31e-127 376
here you can select the hit that you think fit best your gene depending on percentage identity, lenght,evalue... (here is the column names : qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore) You could also select the name of the longest isoform but I think looking at the blast will be more acurate.
I also noticed that the rank of mRNA isoforms in the gff output file is different from their rank on Ensembl gff, and that the order of gene names under Name tag on column nine corresponds to the order of transcript isoforms. What are the mRNA isoforms ranked based on?
I am not sure I understand your question, can you give a precise example or elaborate.
Here is a simplified example of a gene with four mRNA isoforms and two gene names. This gene has already been given a name on Ensembl:
Ensembl .gff file
1 ensembl gene 18512182 18559642 . + . ID=gene:ENSCHAG00000018396;Name=kcnc3a;biotype=protein_coding;description=potassium voltage-gated channel%2C Shaw-related subfamily%2C member 3a [Source:ZFIN%3BAcc:ZDB-GENE-100901-1];gene_id=ENSCHAG00000018396;logic_name=ensembl 1 ensembl mRNA 18512182 18542868 . + . ID=transcript:ENSCHAT00000042524;Parent=gene:ENSCHAG00000018396;Name=kcnc3a-204;biotype=protein_coding;transcript_id=ENSCHAT00000042524 1 ensembl mRNA 18512182 18552795 . + . ID=transcript:ENSCHAT00000042503;Parent=gene:ENSCHAG00000018396;Name=kcnc3a-203;biotype=protein_coding;transcript_id=ENSCHAT00000042503 1 ensembl mRNA 18512182 18559642 . + . ID=transcript:ENSCHAT00000042492;Parent=gene:ENSCHAG00000018396;Name=kcnc3a-201;biotype=protein_coding;transcript_id=ENSCHAT00000042492 1 ensembl mRNA 18512182 18559642 . + . ID=transcript:ENSCHAT00000042501;Parent=gene:ENSCHAG00000018396;Name=kcnc3a-202;biotype=protein_coding;transcript_id=ENSCHAT00000042501
FunctionalAnnotation pipeline .gff file
1 ensembl gene 18512182 18559642 . + . ID=NBISG00000000651;Name=kcnc1_kcnc3;biotype=protein_coding;description=potassium voltage-gated channel%2C Shaw-related subfamily%2C member 3a [Source:ZFIN%3BAcc:ZDB-GENE-100901-1];gene_id=ENSCHAG00000018396;logic_name=ensembl;makerName=gene:ENSCHAG00000018396;version=1 1 ensembl mRNA 18512182 18542868 . + . ID=NBISM00000001625;Parent=NBISG00000000651;Name=kcnc1_kcnc3_iso3;biotype=protein_coding;makerName=transcript:ENSCHAT00000042524;product=Potassium voltage-gated channel subfamily C member 1;transcript_id=ENSCHAT00000042524 1 ensembl mRNA 18512182 18552795 . + . ID=NBISM00000001626;Parent=NBISG00000000651;Name=kcnc1_kcnc3_iso2;biotype=protein_coding;makerName=transcript:ENSCHAT00000042503;product=Potassium voltage-gated channel subfamily C member 3;transcript_id=ENSCHAT00000042503 1 ensembl mRNA 18512182 18559642 . + . ID=NBISM00000001627;Parent=NBISG00000000651;Name=kcnc1_kcnc3_iso1;biotype=protein_coding;makerName=transcript:ENSCHAT00000042492;product=Potassium voltage-gated channel subfamily C member 1;transcript_id=ENSCHAT00000042492 1 ensembl mRNA 18512182 18559642 . + . ID=NBISM00000001628;Parent=NBISG00000000651;Name=kcnc1_kcnc3_iso4;biotype=protein_coding;makerName=transcript:ENSCHAT00000042501;product=Potassium voltage-gated channel subfamily C member 1;transcript_id=ENSCHAT00000042501
1. If I understand it correctly, the pipeline BLAST the sequences in the input gff file against uniprot protein database to find gene name. Is there any possibility to find out what species the gene name comes from?
Yes as before you can have a look at the blast output :
maker-LG7-exonerate_protein2genome-gene-24.13-mRNA-15 sp|F4JH46|DTX34_**ARATH** 73.210 377 82 1 1 358 146 522 0.0 561
I appreciate any help and suggestion.
Thank you so much! This is exactly what I was looking for!
Hi, I would like to run Functional Annotation pipeline
(FunctionalAnnotation.nf)
on a genome of about 726 Mb. I tested this pipeline on one chromosome and it worked correctly, but when trying to run it on the whole genome, I run into the following error message:Could you please help me know what config needs to be changed to increase the run time required for the process? I appreciate your attention to this issue.