ComparativeGenomicsToolkit / Comparative-Annotation-Toolkit

Apache License 2.0
164 stars 48 forks source link

biotype = transcript_biotype_map[tx_id] #78

Closed fbemm closed 6 years ago

fbemm commented 6 years ago

Hey,

I am getting a biotype related error:

Runtime error: Traceback (most recent call last): File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/env/local/lib/python2.7/site-packages/luigi/worker.py", line 191, in run new_deps = self._run_get_new_deps() File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/env/local/lib/python2.7/site-packages/luigi/worker.py", line 129, in _run_get_new_deps task_gen = self.task.run() File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/cat/__init__.py", line 1177, in run tm_args.local_near_best, json_target) File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/cat/filter_transmap.py", line 92, in filter_transmap biotype = transcript_biotype_map[tx_id] KeyError: 'rna57427'

The gp_attrs file looks fine:

rna26495 ID rna26495 rna26495 Parent gene16661 rna26495 Dbxref GeneID:3768844 rna26495 Note pre-tRNA-tRNA-Ile (anticodon: TAT) rna26495 gbkey tRNA rna26495 gene 60263.TRNA-ILE-1 rna26495 product tRNA-Ile Not sure about the gp file though:

rna26495 Chr3 + 1738790 1738864 1738864 1738864 1 1738790, 1738864, 0 gene16661 incmpl incmpl -1,

GFF file looks ok but lags transcript as expected:

Chr3 RefSeq tRNA 1738791 1738864 . + . ID=rna26495;Parent=gene16661;Dbxref=GeneID:3768844,Araport:AT3G05835,TAIR:AT3G05835;Note=pre-tRNA-tRNA-Ile (anticodon: TAT);gbkey=tRNA;gene=60263.TRNA-ILE-1;product=tRNA-Ile Chr3 RefSeq exon 1738791 1738864 . + . ID=id144254;Parent=rna26495;Dbxref=GeneID:3768844,Araport:AT3G05835,TAIR:AT3G05835;Note=pre-tRNA-tRNA-Ile (anticodon: TAT);gbkey=tRNA;gene=60263.TRNA-ILE-1;product=tRNA-Ile

Any idea? F

ifiddes commented 6 years ago

Hmmm.... I am sure this is somehow related to my parsing of the NCBI GFF3 still. Can you share your database file again?

fbemm commented 6 years ago

Here is the db file:

https://drive.google.com/open?id=0BwX5gnQzGAU8MUVhUnJvY011c28

It's NCBI's A. thaliana GFF.

ifiddes commented 6 years ago

Yes, this is related to my fix for handling the updated version of gff3ToGenePred, and how NCBI GFF3 have their tRNA records broken (no transcript level feature). Thanks for pointing me to the actual GFF3, it was very helpful in figuring this out.

I have pushed a commit that seems to fix this for the A. thaliana GFF3, I am ready for the next genome you try to break it again :)

fbemm commented 6 years ago

Looks like I did it again. Not sure if that is related.

Runtime error: Traceback (most recent call last): File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/env/local/lib/python2.7/site-packages/luigi/worker.py", line 191, in run new_deps = self._run_get_new_deps() File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/env/local/lib/python2.7/site-packages/luigi/worker.py", line 129, in _run_get_new_deps task_gen = self.task.run() File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/cat/__init__.py", line 806, in run gene_name = d['Name'] KeyError: 'Name'

ifiddes commented 6 years ago

Oops, made a change in test that I didn't put onto master. Should be fixed now, sorry.

fbemm commented 6 years ago

The biotype issue seems to be fixed. Now I see a failure similar to #58. Going to dig out what is passed to grep.

burrito 2017-10-26 15:13:12,555 MainThread WARNING toil.leader: The job seems to have left a log file, indicating failure: 'join_genes' e/O/jobTtAJ4H
burrito 2017-10-26 15:13:12,556 MainThread WARNING toil.leader: e/O/jobTtAJ4H    ---TOIL WORKER OUTPUT LOG---
burrito 2017-10-26 15:13:12,556 MainThread WARNING toil.leader: e/O/jobTtAJ4H    INFO:toil:Running Toil version 3.8.0-4c83830e4f42594d995e01ccc07b47396b88c9e7.
burrito 2017-10-26 15:13:12,556 MainThread WARNING toil.leader: e/O/jobTtAJ4H    WARNING:toil.resource:Can't find resource for leader path '/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/cat'
burrito 2017-10-26 15:13:12,556 MainThread WARNING toil.leader: e/O/jobTtAJ4H    WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit', name='cat.augustus_cgp', fromVirtualEnv=F
burrito 2017-10-26 15:13:12,556 MainThread WARNING toil.leader: e/O/jobTtAJ4H    INFO:toil.fileStore:Starting job ('join_genes' e/O/jobTtAJ4H) with ID (0f475183f3af6f968a0b4285631fdeb2111f67a5).
burrito 2017-10-26 15:13:12,556 MainThread WARNING toil.leader: e/O/jobTtAJ4H    WARNING:toil.resource:Can't find resource for leader path '/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/cat'
burrito 2017-10-26 15:13:12,556 MainThread WARNING toil.leader: e/O/jobTtAJ4H    WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit', name='cat.augustus_cgp', fromVirtualEnv=F
burrito 2017-10-26 15:13:12,556 MainThread WARNING toil.leader: e/O/jobTtAJ4H    Traceback (most recent call last):
burrito 2017-10-26 15:13:12,556 MainThread WARNING toil.leader: e/O/jobTtAJ4H      File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/env/local/lib/python2.7/site-packages/toil/worker.py", line 340, in main
burrito 2017-10-26 15:13:12,556 MainThread WARNING toil.leader: e/O/jobTtAJ4H        job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
burrito 2017-10-26 15:13:12,556 MainThread WARNING toil.leader: e/O/jobTtAJ4H      File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/env/local/lib/python2.7/site-packages/toil/job.py", line 1289, in _runner
burrito 2017-10-26 15:13:12,556 MainThread WARNING toil.leader: e/O/jobTtAJ4H        returnValues = self._run(jobGraph, fileStore)
burrito 2017-10-26 15:13:12,556 MainThread WARNING toil.leader: e/O/jobTtAJ4H      File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/env/local/lib/python2.7/site-packages/toil/job.py", line 1234, in _run
burrito 2017-10-26 15:13:12,556 MainThread WARNING toil.leader: e/O/jobTtAJ4H        return self.run(fileStore)
burrito 2017-10-26 15:13:12,557 MainThread WARNING toil.leader: e/O/jobTtAJ4H      File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/env/local/lib/python2.7/site-packages/toil/job.py", line 1406, in run
burrito 2017-10-26 15:13:12,557 MainThread WARNING toil.leader: e/O/jobTtAJ4H        rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
burrito 2017-10-26 15:13:12,557 MainThread WARNING toil.leader: e/O/jobTtAJ4H      File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/cat/augustus_cgp.py", line 303, in join_genes
burrito 2017-10-26 15:13:12,557 MainThread WARNING toil.leader: e/O/jobTtAJ4H        tools.procOps.run_proc(cmd, stdout=join_genes_file)
burrito 2017-10-26 15:13:12,557 MainThread WARNING toil.leader: e/O/jobTtAJ4H      File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/tools/procOps.py", line 36, in run_proc
burrito 2017-10-26 15:13:12,557 MainThread WARNING toil.leader: e/O/jobTtAJ4H        pl.wait()
burrito 2017-10-26 15:13:12,557 MainThread WARNING toil.leader: e/O/jobTtAJ4H      File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/tools/pipeline.py", line 1127, in wait
burrito 2017-10-26 15:13:12,557 MainThread WARNING toil.leader: e/O/jobTtAJ4H        self.raiseIfExcept()
burrito 2017-10-26 15:13:12,557 MainThread WARNING toil.leader: e/O/jobTtAJ4H      File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/tools/pipeline.py", line 1085, in raiseIfExcept
burrito 2017-10-26 15:13:12,557 MainThread WARNING toil.leader: e/O/jobTtAJ4H        p.raiseIfExcept()
burrito 2017-10-26 15:13:12,557 MainThread WARNING toil.leader: e/O/jobTtAJ4H      File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/tools/pipeline.py", line 749, in raiseIfExcept
burrito 2017-10-26 15:13:12,557 MainThread WARNING toil.leader: e/O/jobTtAJ4H        raise self.exceptInfo[0], self.exceptInfo[1], self.exceptInfo[2]
burrito 2017-10-26 15:13:12,557 MainThread WARNING toil.leader: e/O/jobTtAJ4H    ProcException: process exited 1: grep -P "     AUGUSTUS        (exon|CDS|start_codon|stop_codon|tts|tss)       "
burrito 2017-10-26 15:13:12,557 MainThread WARNING toil.leader: e/O/jobTtAJ4H    ERROR:toil.worker:Exiting the worker because of a failed job on host burrito
burrito 2017-10-26 15:13:12,557 MainThread WARNING toil.leader: e/O/jobTtAJ4H    WARNING:toil.jobGraph:Due to failure we are reducing the remaining retry count of job 'join_genes' e/O/jobTtAJ4H with ID e/O/jobTtAJ4H to 0
burrito 2017-10-26 15:13:12,558 MainThread WARNING toil.leader: Job 'join_genes' e/O/jobTtAJ4H with ID e/O/jobTtAJ4H is completely failed
burrito 2017-10-26 15:13:12,562 MainThread INFO toil.leader: Job ended successfully: 'join_genes' K/Y/jobC1CfJP
ifiddes commented 6 years ago

This means that grep returned nothing. Which means that none of those feature types (exon/CDS/start_codon/stop_codon/tts/tss) are present in the input file. Which means something went very wrong with augustusCGP.

Did you set the --augustus-species flag to something relevant to A. thaliana? Does that model have a UTR model? I assume it must, or you must have set --augustus-utr-off otherwise the augustusCGP module would not have made it to this point.

Unfortunately, this is going to be tricky to debug because of it being wrapped in the CGP toil process. Somewhere buried in your toil folder is what will become the raw augustus GTF file. Inspecting that would get us to possibly figuring out what is going on. One ugly way to do this would be to grep for "BEGIN CHUNK" in the toil fileStore folder (are you using --disableCaching? If so, it will be the location of --workDir, not to be confused with --work-dir, which is super confusing sorry about that).

One last possibility is your augustus version. Did you run the test data with --augustus-cgp flag set, and it didn't fail?

fbemm commented 6 years ago

I turned off the UTR option and I am using the correct model for Augustus. CAT and Augustus without CGP run fine and the results look good. Going to do the test run again but I am using the latest Augustus as far as I know.

ifiddes commented 6 years ago

Let me know if the test worked or not, I can try adding a try-catch clause to the command to at least get an idea of what is happening. How long do the CGP jobs run for?

fbemm commented 6 years ago

Runtime is pretty short. Can CGP actually handle star like trees? Like (KBS-Mac-74:1,Cvi-0:1,TAIR10:1,Ler-0:1,Col-0:1,Ty-1:1,Can-0:1)Anc0; for example?

ifiddes commented 6 years ago

That's actually a good question. I know Stefanie modified homGeneMapping to handle them by artificially binarizing internally, but I have no idea if that works for CGP. I can try to test it. Lets try to get a handle on what the output is for now. I just made a new branch that will wrap that call and write to the log both the contents of the FOFN and the path of the raw output. It will be buried in the toil folder, but at least we can start to take a look at what it is actually saying.

fbemm commented 6 years ago

Restarted everything. In parallel I tested things with a simple tree (A,B),Anc0 and CGP is still crashing. So most likely it is not the star like tree. I am offlinish the next week but I let you know when I found the fofn and the input for CGP. Thanks!

fbemm commented 6 years ago

It's going slowly. Log writing works but the workDir is empty.

RuntimeError: joingenes failed. gtf_fofn_contents: ['/ebio/abt6_projects9/sixref_genomes/data/genome/6_annotation/cat/Ler-0/toil/toil-21f9b15e-6635-431b-b14d-3f8787a72a78/tmpir7SPZ/ad5ff89e-cf86-4377-a623-a8988d96e206/tmpNKf1oL.tmp\n', '/ebio/abt6_projects9/sixref
_genomes/data/genome/6_annotation/cat/Ler-0/toil/toil-21f9b15e-6635-431b-b14d-3f8787a72a78/tmpir7SPZ/ad5ff89e-cf86-4377-a623-a8988d96e206/tmpUCHwqH.tmp\n', '/ebio/abt6_projects9/sixref_genomes/data/genome/6_annotation/cat/Ler-0/toil/toil-21f9b15e-6635-431b-b14d-3f
8787a72a78/tmpir7SPZ/ad5ff89e-cf86-4377-a623-a8988d96e206/tmp7euCgq.tmp\n',
fbemm commented 6 years ago
INFO:toil:Running Toil version 3.8.0-4c83830e4f42594d995e01ccc07b47396b88c9e7.
WARNING:toil.resource:Can't find resource for leader path '/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/cat'
WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit', name='cat.augustus_cgp', fromVirtualEnv=False)
INFO:toil.fileStore:Starting job ('join_genes' o/t/job_r89aI) with ID (7b40d90f32943401e2006b8e9b8a9c271fa238d1).
WARNING:toil.resource:Can't find resource for leader path '/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/cat'
WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit', name='cat.augustus_cgp', fromVirtualEnv=False)
Traceback (most recent call last):
  File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/env/local/lib/python2.7/site-packages/toil/worker.py", line 340, in main
    job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
  File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/env/local/lib/python2.7/site-packages/toil/job.py", line 1289, in _runner
    returnValues = self._run(jobGraph, fileStore)
  File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/env/local/lib/python2.7/site-packages/toil/job.py", line 1234, in _run
    return self.run(fileStore)
  File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/env/local/lib/python2.7/site-packages/toil/job.py", line 1406, in run
    rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
  File "/ebio/abt6_projects9/abt6_software/bin/Comparative-Annotation-Toolkit/cat/augustus_cgp.py", line 307, in join_genes
    raw_gtf_file))
fbemm commented 6 years ago

Test data works btw.

fbemm commented 6 years ago

I am manually running CGP now (augustus --species=arabidopsis --treefile=Ler-0.nwk --alnfile=Ler-0.maf --speciesfilenames=Ler-0.table). It seems as it doesn't like dash in the species id (e.g., Ler-0). Not sure that is the problem but at least it runs with Ler0!

ifiddes commented 6 years ago

Good to know. I will ask Stefanie and Mario about that. I can add validation checks. Unfortunately, there is no current way to change species names in a HAL file, although it is a feature I have requested before.

fbemm commented 6 years ago

Yeah I just restarted pCactus ... AugCGP run through with a manually recoded MAF. I let you know when CAT is done. If I remember correctly also just numbers as species IDs are not that good.

ifiddes commented 6 years ago

Great! One thing to be careful about with running augCGP by hand as you did above -- the performance can be quite poor with the default model. If you want to train the model by hand, you can look the augustus_cgp.py file to see how it is done. It involves providing the --trainFeatureFile flag which contains a subset of existing annotations on the reference genome, ~5k-10k exons. You set --param_outfile to where you want the parameters to go. Once that is done, you can use that file in a normal CGP prediction.

ifiddes commented 6 years ago

Also note that the CGP model is separate from the augustus species model.

fbemm commented 6 years ago

Correct me if I am wrong. The current Augustus version already comes with a model for A. thaliana.

ifiddes commented 6 years ago

Ah you are right it does! log_reg_parameters_arabidopsis.cfg. In the new format, too (they redid the parameter set a few months ago). Nevermind then, carry along :)

fbemm commented 6 years ago

Fixed!

jwli-code commented 5 months ago

I am manually running CGP now (augustus --species=arabidopsis --treefile=Ler-0.nwk --alnfile=Ler-0.maf --speciesfilenames=Ler-0.table). It seems as it doesn't like dash in the species id (e.g., Ler-0). Not sure that is the problem but at least it runs with Ler0!

Hello, I was wondering how to set species=arabidopsis when CAT is run because I also want to use the model of Arabidopsis