Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
334 stars 80 forks source link

Compleasm_to_hints error #752

Closed SamCT closed 4 months ago

SamCT commented 5 months ago

Hello Dr. Hoff,

Great meeting you at PAG.

I am seeing the compleasm error when compleasm is enabled (--busco_lineage=embryophyta). This is within a standard BRAKER3 run, providing rna-seq .bam, protein file, and genome softmask.

Trying to execute the following command:

/opt/compleasm_kit/compleasm.py run -l embryophyta -a /data/genome.fa -t 36 -o compleasm_genome_out
Suceeded in executing command.
Traceback (most recent call last):
  File "/usr/share/augustus/scripts/compleasm_to_hints.py", line 162, in <module>
    main()
  File "/usr/share/augustus/scripts/compleasm_to_hints.py", line 132, in main
    busco_ids = extract_tx_ids_from_tsv(args.scratch_dir + '/' + args.database + '/full_table.tsv')
  File "/usr/share/augustus/scripts/compleasm_to_hints.py", line 65, in extract_tx_ids_from_tsv
    with open(tsv_file, newline='') as csvfile:
FileNotFoundError: [Errno 2] No such file or directory: 'compleasm_genome_out/embryophyta/full_table.tsv'

I built a new braker3.sif but not sure if it contained the bug fixes. I have a new genome and the annotation is BUSCO score is stubbornly low, no matter what combination of BRAKER3, BRAKER3-Isoseq, GALBA or GenomeThreader hintsfiles I try to TSEBRA combine. This is my last resort :) , please let me know your thoughts.

Thanks, Sam

jbh-cas commented 4 months ago

I believe the problem is that although compleasm.py will accept embryophyta as the lineage name the compleasm_to_hints.py script requires an odb10 suffix so embryophyta_odb10 is needed for your --busco_lineage arg.

compleasm.py adds the _odb10 part onto the lineage if it is not there and creates the dir with that name. Unfortunately compleasm_to_hints.py does not make that addition and so looks for the directory without the _odb10 suffix when the dir name has it.

If you just change that, using embryophyta_odb10, and run again using the same output directory it might work. It did in my case at least, just redoing some of the compleasm work but retaining the other work it had completed. Though that being said my run is not yet completed so other surprises may await.

Good luck.

SamCT commented 4 months ago

So I did manage to fix this, and thanks to @jbh-cas I did add the full suffix and it worked.

However, I don't know if the compleasm merge is doing it's job correctly, since it doesn't appear like it is actually merging the BUSCOs. Here is the best_by_compleasm.log:

BRAKER is missing 13.26 BUSCOs. GeneMark is missing 2.85 BUSCOs. Augustus is missing 2.54 BUSCOs. All BUSCOs present in augustus.hints.gtf and genemark.gtf will be added to the braker.gtf gene set. Attempted to merge additional BUSCOs onto braker.gtf but there are no BUSCOs to be added. The BRAKER gene set /data/Grapevine/BRAKER/B3/B3_Comp/braker/braker.gtf is the best one. It lacks 13.26% BUSCOs.

jbh-cas commented 4 months ago

SamCT,

I had nearly the same result for a skink as you did for an embryophyta, as shown below (... replaces server specific info), 13.41% missing BUSCOs chosen when 2.55% and 6.72% sets were available. And I had previously run BRAKER3 in etp mode and it found 40 genes and 93 mRNAs more than this etpc mode run, which only differed in using the --busco_lineage command argument.

I plan to revisit next week when I have some more time to devote, but I'm hoping that there is some problem in the TSEBRA/bin/best_by_compleasm.py script.

I know there must have been a push to get the work done for PAG and breaks are well-deserved, tho it would be nice to have some acknowledgment that the BRAKER folks have seen some of this (Neng Huang, compleasm's author, recently was not only gracious enough to acknowledge a request but made a small addition in about a day).

$ cat best_by_compleasm.log
.../Augustus/scripts/getAnnoFastaFromJoingenes.py -g .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/genome.fa -f .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/GeneMark-ETP/genemark.gtf -o .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/bbc/genemark
/ccg/bin/compleasm.git/compleasm.py protein -p .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/bbc/genemark.aa -l sauropsida_odb10 -t 48 -o .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/bbc/genemark
/ccg/bin/compleasm.git/compleasm.py protein -p .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/braker.aa -l sauropsida_odb10 -t 48 -o .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/bbc/braker
/ccg/bin/compleasm.git/compleasm.py protein -p .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/augustus.hints.aa -l sauropsida_odb10 -t 48 -o .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/bbc/augustus
BRAKER is missing 13.41 BUSCOs.
GeneMark is missing 6.72 BUSCOs.
Augustus is missing 2.55 BUSCOs.
All BUSCOs present in augustus.hints.gtf and genemark.gtf will be added to the braker.gtf gene set.
Attempted to merge additional BUSCOs onto braker.gtf but there are no BUSCOs to be added.
The BRAKER gene set .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/braker.gtf is the best one. It lacks 13.41% BUSCOs

best, Jim Henderson

KatharinaHoff commented 4 months ago

I am currently on vacation. I did push a docker container with the latest compleasm after PAG. I will look at this when I am back.

Jim Henderson @.***> schrieb am Sa. 10. Feb. 2024 um 00:50:

SamCT,

I had nearly the same result for a skink as you did for an embryophyta, as shown below (... replaces server specific info), 13.41% missing BUSCOs chosen when 2.55% and 6.72% sets were available. And I had previously run BRAKER3 in etp mode and it found 40 genes and 93 mRNAs more than this etpc mode run, which only differed in using the --busco_lineage command argument.

I plan to revisit next week when I have some more time to devote, but I'm hoping that there is some problem in the TSEBRA/bin/best_by_compleasm.py script.

I know there must have been a push to get the work done for PAG and breaks are well-deserved, tho it would be nice to have some acknowledgment that the BRAKER folks have seen some of this (Neng Huang, compleasm's author, recently was not only gracious enough to acknowledge a request but made a small addition in about a day).

$ cat best_by_compleasm.log .../Augustus/scripts/getAnnoFastaFromJoingenes.py -g .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/genome.fa -f .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/GeneMark-ETP/genemark.gtf -o .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/bbc/genemark /ccg/bin/compleasm.git/compleasm.py protein -p .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/bbc/genemark.aa -l sauropsida_odb10 -t 48 -o .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/bbc/genemark /ccg/bin/compleasm.git/compleasm.py protein -p .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/braker.aa -l sauropsida_odb10 -t 48 -o .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/bbc/braker /ccg/bin/compleasm.git/compleasm.py protein -p .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/augustus.hints.aa -l sauropsida_odb10 -t 48 -o .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/bbc/augustus BRAKER is missing 13.41 BUSCOs. GeneMark is missing 6.72 BUSCOs. Augustus is missing 2.55 BUSCOs. All BUSCOs present in augustus.hints.gtf and genemark.gtf will be added to the braker.gtf gene set. Attempted to merge additional BUSCOs onto braker.gtf but there are no BUSCOs to be added. The BRAKER gene set .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/braker.gtf is the best one. It lacks 13.41% BUSCOs

best, Jim Henderson

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/752#issuecomment-1936706331, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JDVDC3F2XDPDRQR6PLYS2R27AVCNFSM6AAAAABCVKP4SGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZWG4YDMMZTGE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jbh-cas commented 4 months ago

Thank you very much for the ACK and most of all have a wonderful vacation! github needs an out-of-office message setting like email clients have.

We appreciate all you and the team does for the genome annotation community.

I have been doing git pulls on 4 repos: Augustus, TSEBRA, BRAKER and GeneMark-ETP. I hope that keeps me current.

-jbh

On 02/10/2024 3:04 AM PST Katharina Hoff @.***> wrote:

I am currently on vacation. I did push a docker container with the latest compleasm after PAG. I will look at this when I am back.

Jim Henderson @.***> schrieb am Sa. 10. Feb. 2024 um 00:50:

SamCT,

I had nearly the same result for a skink as you did for an embryophyta, as shown below (... replaces server specific info), 13.41% missing BUSCOs chosen when 2.55% and 6.72% sets were available. And I had previously run BRAKER3 in etp mode and it found 40 genes and 93 mRNAs more than this etpc mode run, which only differed in using the --busco_lineage command argument.

I plan to revisit next week when I have some more time to devote, but I'm hoping that there is some problem in the TSEBRA/bin/best_by_compleasm.py script.

I know there must have been a push to get the work done for PAG and breaks are well-deserved, tho it would be nice to have some acknowledgment that the BRAKER folks have seen some of this (Neng Huang, compleasm's author, recently was not only gracious enough to acknowledge a request but made a small addition in about a day).

$ cat best_by_compleasm.log .../Augustus/scripts/getAnnoFastaFromJoingenes.py -g .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/genome.fa -f .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/GeneMark-ETP/genemark.gtf -o .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/bbc/genemark /ccg/bin/compleasm.git/compleasm.py protein -p .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/bbc/genemark.aa -l sauropsida_odb10 -t 48 -o .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/bbc/genemark /ccg/bin/compleasm.git/compleasm.py protein -p .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/braker.aa -l sauropsida_odb10 -t 48 -o .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/bbc/braker /ccg/bin/compleasm.git/compleasm.py protein -p .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/augustus.hints.aa -l sauropsida_odb10 -t 48 -o .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/bbc/augustus BRAKER is missing 13.41 BUSCOs. GeneMark is missing 6.72 BUSCOs. Augustus is missing 2.55 BUSCOs. All BUSCOs present in augustus.hints.gtf and genemark.gtf will be added to the braker.gtf gene set. Attempted to merge additional BUSCOs onto braker.gtf but there are no BUSCOs to be added. The BRAKER gene set .../Spondylurus_culebrae/anno/braker4_etpc/output_etpc/braker.gtf is the best one. It lacks 13.41% BUSCOs

best, Jim Henderson

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/752#issuecomment-1936706331, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JDVDC3F2XDPDRQR6PLYS2R27AVCNFSM6AAAAABCVKP4SGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZWG4YDMMZTGE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/752#issuecomment-1936974378, or unsubscribe https://github.com/notifications/unsubscribe-auth/AELSO4PGEQ3CXO3WZ5SJ2STYS5H5DAVCNFSM6AAAAABCVKP4SGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMZWHE3TIMZXHA. You are receiving this because you were mentioned.Message ID: @.***>

KatharinaHoff commented 4 months ago

To be up to date, you also need the latest compleasm (there was a new version released around middle of January). I do not use the official ETP in the current container, I use my own fork (because @rchikhi made a bugfix that was not incorporated into the official ETP, and I made the isoseq extension in my own fork).

I made a commit to the Augustus repository to expand the _odb10 for the lineage in compeasm_to_hints.py https://github.com/Gaius-Augustus/Augustus/commit/e42b3354b8a7fd02af82364d2ab44dc08bb2a050 but this will not solve your actual problem.

Something else seems to be wrong with the best_by_compleasm.py script from TSEBRA. It would be easiest to look at this if e.g. @SamCT shared the data with me in some way. I basically need the input files that are used for best_by_compleasm.py . If the genome file is way too large to share, please let me know. I could also rewrite the best_by_compleasm.py script to run with the protein sequence input - but if possible, I'd like to focus on fixing the problem, only. (Too many open ends at the moment.) I can provide a storage link via e-mail if needed. Please share data to katharina.hoff@uni-greifswald.de .

jbh-cas commented 4 months ago

I thought I would show my outputs and see if that helped. Just starting to look at the best_by python script but shouldn't bbc have a few more AA and codingseq files.

$ tree -t -L 2 $(pwd)
/home/drivera/Spondylurus_culebrae/anno/braker4_etpc/output_etpc
├── genome_header.map
├── hintsfile.gff
├── species
│   └── SponCul
├── what-to-cite.txt
├── braker.gtf
├── braker.codingseq
├── braker.aa
├── best_by_compleasm.log
├── braker.gff3
├── bbc
│   ├── genemark.codingseq
│   ├── genemark.aa
│   ├── augustus
│   ├── braker
│   └── genemark
├── errors
│   ├── compleasm_to_hints.stderr
│   ├── samtools.sort.SponCul_liver.stderr
│   ├── samtools.sort.SponCul_lung.stderr
│   ├── GeneMark-ETP.stderr
│   ├── GeneMark-ETP.stdout
│   ├── gbFilterEtraining.stderr
│   └── aa2nonred.stderr
├── GeneMark-ETP
│   ├── etp_config.yaml
│   ├── prothint_gmst.log
│   ├── filter_gmst.log
│   ├── genemark.gtf
│   ├── genemark_supported.gtf
│   ├── training.gtf
│   ├── rnaseq
│   └── proteins.fa
├── Augustus
│   ├── augustus.hints.gtf
│   ├── augustus.hints.codingseq
│   ├── augustus.hints.aa
│   └── augustus.hints.gff3
├── braker.log

here we can see that compleasm protein called for all 3

$ tree -t -L 2 $(pwd)
/home/drivera/Spondylurus_culebrae/anno/braker4_etpc/output_etpc/bbc
├── genemark.codingseq
├── genemark.aa
├── augustus
│   ├── sauropsida_odb10_hmmsearch_output
│   ├── full_table.tsv
│   └── summary.txt
├── braker
│   ├── sauropsida_odb10_hmmsearch_output
│   ├── full_table.tsv
│   └── summary.txt
└── genemark
    ├── sauropsida_odb10_hmmsearch_output
    ├── full_table.tsv
    └── summary.txt
KatharinaHoff commented 4 months ago

For the data of @SamCT , I can confirm that the problem is now fixed. The fixed script is currently only in the TSEBRA repository. It will take me a little while to build and test the new BRAKER docker container, and I will not do it right now. The script can executed stand-alone with an existing previous BRAKER output directory. I want to fix another issue before pushing the docker container.

KatharinaHoff commented 4 months ago

@jbh-cas your output files look alright. It was a trivial bug. Somewhere down the line, the hmmer search output directory name got truncated.

jbh-cas commented 4 months ago

Thank you very much!

I am rerunning from the beginning with BRAKER and TSEBRA and Augustus mods (I have compleasm.py 0.2.5). I'm calling this BRAKER4 internally but don't know how you are terming ETPC mode.

I tried to run just the best_by_compleasm TSEBRA script but the genome.fa file is deleted in cleanup at the end of the BRAKER run. I know I could reconstitute it but I'll just let version 3.0.8 run overnight to check.

Tho I have bams now I am rerunning with rna-seq fastq files to invoke hisat2 and check that execution pathway. But the two cd to workdir added in braker.pl is a good belt and suspenders mod.