CompSynBioLab-KoreaUniv / FunGAP

FunGAP: fungal Genome Annotation Pipeline
109 stars 33 forks source link

Error on wrapper_catch_bad_genes step #58

Open SolayMane opened 3 years ago

SolayMane commented 3 years ago

I run fungap on Docker and I get this error :

[Wrapper] /workspace/FunGAP/catch_bad_genes.py --gff3_files /fungap_workspace/fungap_out/augustus_out/augustus.gff3 /fungap_workspace/fungap_out/maker_out/SRR13198488/maker_SRR13198488.gff3 /fungap_workspace/fungap_out/braker_out/SRR13198488/braker_SRR13198488.gff3 --genome_assembly /fungap_workspace/genome_scaffolds.fasta --output_dir /fungap_workspace/fungap_out/gene_filtering Traceback (most recent call last): File "/workspace/FunGAP/catch_bad_genes.py", line 179, in main() File "/workspace/FunGAP/catch_bad_genes.py", line 58, in main catch_middle_stop(gff3_files, genome_assembly_file, output_dir) File "/workspace/FunGAP/catch_bad_genes.py", line 118, in catch_middle_stop gene_seq = reduce(operator.add, seq_cds) TypeError: reduce() of empty sequence with no initial value Traceback (most recent call last): File "/workspace/FunGAP/fungap.py", line 826, in main() File "/workspace/FunGAP/fungap.py", line 258, in main gff3_files, genome_assembly, output_dir, d_path, logger File "/workspace/FunGAP/fungap.py", line 732, in catch_bad_genes check_call(command_args) File "/opt/conda/lib/python3.7/subprocess.py", line 363, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/workspace/FunGAP/catch_bad_genes.py', '--gff3_files', '/fungap_workspace/fungap_out/augustus_out/augustus.gff3', '/fungap_workspace/fungap_out/maker_out/SRR13198488/maker_SRR13198488.gff3', '/fungap_workspace/fungap_out/braker_out/SRR13198488/braker_SRR13198488.gff3', '--genome_assembly', '/fungap_workspace/genome_scaffolds.fasta', '--output_dir', '/fungap_workspace/fungap_out/gene_filtering']' returned non-zero exit status 1.

mbnmbn00 commented 3 years ago

Hello, Thank you for the report. That error was caused because recent Braker introduced a bug. The quick fix would be:

  1. Open braker_PREFIX.gff3 (in the braker_out/PREFIX directory)
  2. Remove lines with GeneMark (You may want to make a backup)
  3. Remove gene_filtering folder
  4. Re-run FunGAP with the same command (so it can resume)

I will figure out how FunGAP handles the bug while the Braker developers are working on it. Let me know if you have further questions!

SolayMane commented 3 years ago

thank you for your reply, here is the output of what I have in culumn 2 of the file : cut -f 2 fungap_out/braker_out/SRR13198488/braker_SRR13198488.gff3 | sort | uniq AUGUSTUS no line of GeneMark.

mbnmbn00 commented 3 years ago

Hmm. Could you send me Braker GFF3 and genome assembly you used? My e-mail address is here: mbnmbn00@gmail.com

If you cannot share the files, you may send the part of (e.g., first 10 genes) the Braker GFF3 file.

SolayMane commented 3 years ago

here are the 1st 10 lines of braker_SRR13198488.gff3

Chr_4 AUGUSTUS gene 2037937 2038673 0.96 + . ID=g9539; Chr_4 AUGUSTUS mRNA 2037937 2038673 0.96 + . ID=g9539.t1;Parent=g9539; Chr_4 AUGUSTUS start_codon 2037937 2037939 . + 0 ID=g9539.t1.start1;Parent=g9539.t1; Chr_4 AUGUSTUS CDS 2037937 2037961 0.97 + 0 ID=g9539.t1.CDS1;Parent=g9539.t1; Chr_4 AUGUSTUS exon 2037937 2037961 . + . ID=g9539.t1.exon1;Parent=g9539.t1; Chr_4 AUGUSTUS intron 2037962 2038086 1 + . ID=g9539.t1.intron1;Parent=g9539.t1; Chr_4 AUGUSTUS CDS 2038087 2038673 0.99 + 2 ID=g9539.t1.CDS2;Parent=g9539.t1; Chr_4 AUGUSTUS exon 2038087 2038673 . + . ID=g9539.t1.exon2;Parent=g9539.t1; Chr_4 AUGUSTUS stop_codon 2038671 2038673 . + 0 ID=g9539.t1.stop1;Parent=g9539.t1; Chr_11 AUGUSTUS gene 2926489 2927424 0.99 + . ID=g3985;

mbnmbn00 commented 3 years ago

Those lines look good. It might be one poor gene coordinates.

seq_cds = []
for feature in mrna_sub_features_s:
    if feature.type != 'CDS':
        continue
    seq_cds.append(rec.seq[feature.location.start:feature.location.end])
gene_seq = reduce(operator.add, seq_cds)

So the program complains that seq_cds is empty, which is not possible for the normal entries. I can update the code so that it ignores the empty genes, but I would like to see which entries have such cases.

If you don't mind, could you put the below lines in the catch_bad_genes.py and run the code for me? Right above line 118,

if not seq_cds:
    print(mrna_feature)
    continue
gene_seq = reduce(operator.add, seq_cds)  # This is line 118

Your catch_bad_genes.py command would be:

/workspace/FunGAP/catch_bad_genes.py \
  --gff3_files /fungap_workspace/fungap_out/augustus_out/augustus.gff3 \
    /fungap_workspace/fungap_out/maker_out/SRR13198488/maker_SRR13198488.gff3 \
    /fungap_workspace/fungap_out/braker_out/SRR13198488/braker_SRR13198488.gff3 \
  --genome_assembly /fungap_workspace/genome_scaffolds.fasta \
  --output_dir /fungap_workspace/fungap_out/gene_filtering
SolayMane commented 3 years ago

Thank you it works, below the output of the modified scripts : type: start_codon location: 2391384:2391387 id: g15364.t1.start1 qualifiers: Key: ID, Value: ['g15364.t1.start1'] Key: Parent, Value: ['g15364.t1'] Key: phase, Value: ['0'] Key: source, Value: ['AUGUSTUS']

type: CDS location: 2391384:2391551 id: g15364.t1.CDS1 qualifiers: Key: ID, Value: ['g15364.t1.CDS1'] Key: Parent, Value: ['g15364.t1'] Key: phase, Value: ['0'] Key: score, Value: ['0.97'] Key: source, Value: ['AUGUSTUS']

type: exon location: 2391384:2391551 id: g15364.t1.exon1 qualifiers: Key: ID, Value: ['g15364.t1.exon1'] Key: Parent, Value: ['g15364.t1'] Key: source, Value: ['AUGUSTUS']

type: intron location: 2391551:2391604 id: g15364.t1.intron1 qualifiers: Key: ID, Value: ['g15364.t1.intron1'] Key: Parent, Value: ['g15364.t1'] Key: score, Value: ['0.97'] Key: source, Value: ['AUGUSTUS']

type: CDS location: 2391604:2393408 id: g15364.t1.CDS2 qualifiers: Key: ID, Value: ['g15364.t1.CDS2'] Key: Parent, Value: ['g15364.t1'] Key: phase, Value: ['1'] Key: score, Value: ['0.96'] Key: source, Value: ['AUGUSTUS']

type: exon location: 2391604:2393408 id: g15364.t1.exon2 qualifiers: Key: ID, Value: ['g15364.t1.exon2'] Key: Parent, Value: ['g15364.t1'] Key: source, Value: ['AUGUSTUS']

type: stop_codon location: 2393405:2393408 id: g15364.t1.stop1 qualifiers: Key: ID, Value: ['g15364.t1.stop1'] Key: Parent, Value: ['g15364.t1'] Key: phase, Value: ['0'] Key: source, Value: ['AUGUSTUS']

mbnmbn00 commented 3 years ago

Glad it helped! Though, I still don't get what caused the error. I will keep an eye on it.

If you don't mind, could you click the "star" for this repository?

lexming commented 1 year ago

Even though this issue is closed, I'm facing this very same problem trying to run the test case in the documentation at https://github.com/CompSynBioLab-KoreaUniv/FunGAP/blob/master/INSTALL.md#test-run

Everything works as expected for all those entries with mRNA as feature type and some list of subfeatures. For instance, the last working one is

FEATURE type: mRNA
location: [20671:20966](-)
id: snap_masked-NC_001224.1-processed-gene-0.29-mRNA-1
qualifiers:
    Key: ID, Value: ['snap_masked-NC_001224.1-processed-gene-0.29-mRNA-1']
    Key: Name, Value: ['snap_masked-NC_001224.1-processed-gene-0.29-mRNA-1']
    Key: Parent, Value: ['snap_masked-NC_001224.1-processed-gene-0.29']
    Key: _AED, Value: ['1.00']
    Key: _QI, Value: ['0|0|0|0|1|1|2|0|95']
    Key: _eAED, Value: ['1.00']
    Key: source, Value: ['maker']

SUBFEATURES: [SeqFeature(FeatureLocation(ExactPosition(20960), ExactPosition(20966), strand=-1), type='exon', id='snap_masked-NC_001224.1-processed-gene-0.29-mRNA-1:2'), SeqFeature(FeatureLocation(ExactPosition(20671), ExactPosition(20953), strand=-1), type='exon', id='snap_masked-NC_001224.1-processed-gene-0.29-mRNA-1:1'), SeqFeature(FeatureLocation(ExactPosition(20960), ExactPosition(20966), strand=-1), type='CDS', id='snap_masked-NC_001224.1-processed-gene-0.29-mRNA-1:cds'), SeqFeature(FeatureLocation(ExactPosition(20671), ExactPosition(20953), strand=-1), type='CDS', id='snap_masked-NC_001224.1-processed-gene-0.29-mRNA-1:cds')]

However, at some point, all entries look like the following, without any subfeatures:

FEATURE type: start_codon
location: [186919:186922](+)
id: file_1_file_1_g4549.t1.start1
qualifiers:
    Key: ID, Value: ['file_1_file_1_g4549.t1.start1']
    Key: Parent, Value: ['file_1_file_1_g4549.t1']
    Key: phase, Value: ['0']
    Key: source, Value: ['AUGUSTUS']

SUBFEATURE []

The lack of subfeatures is what ultimately causes the empty sequence in seq_cds.

Are these safe to ignore?

mbnmbn00 commented 1 year ago

Thanks for reporting this. That seems a problem. While I'm running the pipeline myself, could you more information?

Where do you think the prefix "file_1_file1" come from?

lexming commented 1 year ago

The file "file_1_file_1_g4549" is generated by BRAKER. I'm using version 2.1.6.

It seems that we are hitting the same problem as in https://github.com/Gaius-Augustus/TSEBRA/issues/2

mbnmbn00 commented 1 year ago

It seems that they fixed it, but conda version was not updated. (And it seems conda installation is no longer updated) I installed 2.1.6 via conda and am testing it now.

If it doesn't work, the options I have are: 1) Add a step to fix the bug in the installation 2) Give up conda and install the Braker manually 3) Force users to use 2.1.5, which works fine

I think I'm going to go with option 3) because it's the simplest. So, could you switch into 2.1.5 and try again?

If you really want to use 2.1.6, then you may update the code as shown here: https://github.com/Gaius-Augustus/BRAKER/commit/f38630c1cad3e11b525f84d517c7949cb4c2d7eb

Let me know what you think!

lexming commented 1 year ago

Thanks for the feedback. I confirm that BRAKER v2.1.6 with patch https://github.com/Gaius-Augustus/BRAKER/commit/f38630c1cad3e11b525f84d517c7949cb4c2d7eb does work. No issues. :+1:

mbnmbn00 commented 1 year ago

Was there any easy way to apply the patch? Did you do it manually?

lexming commented 1 year ago

I applied the patch manually to the sources of v2.1.6. We use EasyBuild to install software in our HPC cluster, which makes it easy to apply such patches. I also opened a PR to add that patch to the official recipe for BRAKER v2.1.6 in EasyBuild (see https://github.com/easybuilders/easybuild-easyconfigs/pull/17631)

On your side, I suggest to increase the minimum version of BRAKER for FunGAP to v3.0.2, which is the lowest version with this fix.