EBI-Metagenomics / EukCC

Tool to estimate genome quality of microbial eukaryotes
GNU General Public License v3.0
31 stars 9 forks source link

Missing bed entry error in protein mode #6

Open halexand opened 4 years ago

halexand commented 4 years ago

Hello,

I am curious to try out eukcc with some eukaryotic MAGs. I have already predicted proteins and am trying to run eukcc with predicted proteins and associate bed files. I am getting an error when I try to include the bed file, however, that I think must be due to the format of my bed file.

The command I am using is: eukcc --db /vortexfs1/omics/alexander/data/databases/eukccdb -o NAO-all-SRF-20-180-00_bin-42 --protein NAO-all-SRF-20-180-00_bin-42.all.maker.proteins.fasta --bed NAO-all-SRF-20-180-00_bin-42.bed --ncores 8

07/10/2020 19:27:18:  Starting EukCC
07/10/2020 19:27:18:  Searching for proteins to place in the tree
07/10/2020 19:27:24:  Processing Hmmer results
Could not find entry in bed file for genemark-NAO-all-SRF-20-180-00_k119_9849810-processed-gene-0.1-mRNA-1
Could not find entry in bed file for genemark-NAO-all-SRF-20-180-00_k119_9849810-processed-gene-0.1-mRNA-1
Could not find entry in bed file for genemark-NAO-all-SRF-20-180-00_k119_9849810-processed-gene-0.1-mRNA-1
Traceback (most recent call last):
  File "/vortexfs1/home/halexander/.conda/envs/eukcc/bin/eukcc", line 10, in <module>
    sys.exit(main())
  File "/vortexfs1/home/halexander/.conda/envs/eukcc/lib/python3.6/site-packages/eukcc/__main__.py", line 215, in main
    m.place(proteinfaa, bedfile)
  File "/vortexfs1/home/halexander/.conda/envs/eukcc/lib/python3.6/site-packages/eukcc/workflow.py", line 411, in place
    hitOut = h.clean(hmmOut, bedfile, hitOut, self.cfg["mindist"])
  File "/vortexfs1/home/halexander/.conda/envs/eukcc/lib/python3.6/site-packages/eukcc/exec.py", line 274, in clean
    n[k] = int(n[k])
KeyError: 'start'

Notably, when I grep the missing bed entry in the bed file-- it comes up:

NAO-all-SRF-20-180-00_k119_9849810      1382    1650    genemark-NAO-all-SRF-20-180-00_k119_9849810-processed-gene-0.1-mRNA-1:cds       .       -       maker$
DS      1       ID=genemark-NAO-all-SRF-20-180-00_k119_9849810-processed-gene-0.1-mRNA-1:cds;Parent=genemark-NAO-all-SRF-20-180-00_k119_9849810-processed-gen$
-0.1-mRNA-1
NAO-all-SRF-20-180-00_k119_9849810      1382    1650    genemark-NAO-all-SRF-20-180-00_k119_9849810-processed-gene-0.1-mRNA-1:exon:36678        .       -    m
aker    exon    .       ID=genemark-NAO-all-SRF-20-180-00_k119_9849810-processed-gene-0.1-mRNA-1:exon:36678;Parent=genemark-NAO-all-SRF-20-180-00_k119_9849810
-processed-gene-0.1-mRNA-1
NAO-all-SRF-20-180-00_k119_9849810      1382    4175    genemark-NAO-all-SRF-20-180-00_k119_9849810-processed-gene-0.1-mRNA-1   .       -       maker   mRNA .
ID=genemark-NAO-all-SRF-20-180-00_k119_9849810-processed-gene-0.1-mRNA-1;Parent=genemark-NAO-all-SRF-20-180-00_k119_9849810-processed-gene-0.1;Name=genemark-N
AO-all-SRF-20-180-00_k119_9849810-processed-gene-0.1-mRNA-1;_AED=1.00;_eAED=1.00;_QI=0|0|0|0|1|1|5|0|454

I created the bed file by passing a gff3 file created by maker2 with gff2bed < in.gff > out.bed. Is there a preferred method for creating a bed file? Or am I missing something else? Thank you!

openpaul commented 4 years ago

Hi, sorry for the delay, I was on vacation (does Github have a notification system for that?) Glad you want to test EukCC. Which version are you running?

eukcc --version

To address the problem: We noticed that you can also omit the bed file in most cases, thus I would recommend just for now removing the --bed flag, EukCC should then run fine. To explain: EukCC uses bed files indicating the start and the stop of the first and the last exon of the protein to remove close hits of the same panther profile. So each protein should have only one entry, which is different to what you supplied. Its a format we agreed on using, but its not perfect and will likely be replaced in the update of the database.

If you would be willing to share your protein files and the bed file, I would be very interested to fix this bug, as it would be best for EukCC to work as expected. Feel free to email me (saary@ebi.ac.uk)

halexand commented 4 years ago

Hi @openpaul,

No worries and thanks for the reply! I ended up dropping the --bed flag and running it with just the protein file. It runs so quickly that it isn't a big deal to rerun with --bed. I am glad to hear that the problem that the addition of a bed file solves isn't a huge issue!

I am running EukCC version 0.1.5.1. I can email you some files as I think they are too large to attach.

openpaul commented 4 years ago

Glad to hear it worked out for you. Yes feel free to email me, so I can possibly address the core issue.