NBISweden / AGAT

Another Gtf/Gff Analysis Toolkit https://nbisweden.github.io/AGAT/
GNU General Public License v3.0
467 stars 56 forks source link

AGAT extracted only very small set of CDS from the *.gff3 and assembly file. Is there any probelm with the pipeline? #501

Closed Vijithkumar2020 closed 1 week ago

Vijithkumar2020 commented 2 months ago

AGAT was run on docker. The following biocontainer was used from quay.io: 1.4.1--pl5321hdfd78af_0. AUGUSTUS predicted ~49500 coding genes, but AGAT extracted only ~15,000.

The program was run as follows: ``` sudo docker run -v /media/e3349969-3452-4c3a-9b3f-d3931278e4a5:/data \ quay.io/biocontainers/agat:0.8.0--pl5262hdfd78af_0 \ agat_sp_extract_sequences.pl \ -g /data/AUGUSTUS_cds_zea_mays_ref/combined_output.gff3 \ -f /data/contig_out_file.fasta.masked \ -o /data/AUGUSTUS_cds_zea_mays_ref/cds_maize-ref/cds.fasta \ -t cds


The output failed to extract all sequences: ```
WARNING: Problem ! The size of the extracted sequence 305 is different than the specified span: 341.
That often occurs when the fasta file does not correspond to the annotation file. Or the index file comes from another fasta file which had the same name and haven't been removed.
As last possibility your gff contains location errors (Already encountered for a Maker annotation)
Supplement information: seq_id=SVA1_S1_L008_001_contig_737055 ; seq_id_correct=SVA1_S1_L008_001_contig_737055 ; start=1199 ; end=1539 ; SVA1_S1_L008_001_contig_737055 sequence length: 1503 )
WARNING: Problem ! The size of the extracted sequence 487 is different than the specified span: 537.
That often occurs when the fasta file does not correspond to the annotation file. Or the index file comes from another fasta file which had the same name and haven't been removed.
As last possibility your gff contains location errors (Already encountered for a Maker annotation)
Supplement information: seq_id=SVA1_S1_L008_001_contig_234706 ; seq_id_correct=SVA1_S1_L008_001_contig_234706 ; start=226 ; end=762 ; SVA1_S1_L008_001_contig_234706 sequence length: 712 )
WARNING: Problem ! The size of the extracted sequence 165 is different than the specified span: 240.
That often occurs when the fasta file does not correspond to the annotation file. Or the index file comes from another fasta file which had the same name and haven't been removed.
As last possibility your gff contains location errors (Already encountered for a Maker annotation)
Supplement information: seq_id=SVA1_S1_L008_001_contig_741242 ; seq_id_correct=SVA1_S1_L008_001_contig_741242 ; start=114 ; end=353 ; SVA1_S1_L008_001_contig_741242 sequence length: 278 )
WARNING: Problem ! The size of the extracted sequence 430 is different than the specified span: 457.
That often occurs when the fasta file does not correspond to the annotation file. Or the index file comes from another fasta file which had the same name and haven't been removed.
As last possibility your gff contains location errors (Already encountered for a Maker annotation)
Supplement information: seq_id=SVA1_S1_L008_001_contig_235638 ; seq_id_correct=SVA1_S1_L008_001_contig_235638 ; start=78 ; end=534 ; SVA1_S1_L008_001_contig_235638 sequence length: 507 )
Juke34 commented 2 months ago

Check manually the reported cases to see if location in the gff is really out of the corresponding sequence size of the fasta file.

Vijithkumar2020 commented 2 months ago

Thank you for the response. The *.gff file is too huge to identify the error case manually. Is there any recommended gff tool that will isolate a single case based on the seq_id?

Juke34 commented 2 months ago

Use grep or awk to extract the features from SVA1_S1_L008_001_contig_737055 sequence. Then Check the higher position value.

Vijithkumar2020 commented 2 months ago

Okay, I want to add one more thing: AUGUSTUS was run on multiple split FASTA files (the parent FASTA file was split into smaller files) as parallel jobs. So, 8 individual GFF files were later concatenated to generate the combined GFF. While running AGAT, I specified the original FASTA file. Could this have raised any issues?

Juke34 commented 2 months ago

It might depending how you merged the different annotation, because same gene name may have been used in the different files. In that case AGAT may have messed up the annotation linking genes with same name as a unique record. The best is to check manually the reported cases by AGAT

Vijithkumar2020 commented 2 months ago

Thank you for pointing that out. Yes, you're correct that the same gene name was used (e.g., 'g1' appeared in multiple instances), but still, all the seq_ids are unique. Anyway, I will manually check the reported case to narrow it down.

Juke34 commented 2 months ago

To avoid issue related to shared names between file you can use the agat script for the purpose. It will handle names and updates then on the fly to become unique in the final merged file.

Vijithkumar2020 commented 2 months ago

Are you referring to theagat_sp_merge_annotations.pl? I mean I can use all the individual *.GFFs and merge them using this tool.

Juke34 commented 2 months ago

Yes exactly