ComparativeGenomicsToolkit / Comparative-Annotation-Toolkit

Apache License 2.0
165 stars 48 forks source link

mis-formatted hints file #98

Closed nbedelman closed 6 years ago

nbedelman commented 6 years ago

I'm running into a seemingly minor issue with the first augustus step in the pipeline. I get the error below:

2018-06-10 11:15:24,717 - toil.leader - WARNING - 5/v/job9c1JLI    ---TOIL WORKER OUTPUT LOG---
2018-06-10 11:15:24,717 - toil.leader - WARNING - 5/v/job9c1JLI    INFO:toil:Running Toil version 3.13.0a1.
2018-06-10 11:15:24,718 - toil.leader - WARNING - 5/v/job9c1JLI    WARNING:toil.resource:Can't find resource for leader path '/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit/cat'
2018-06-10 11:15:24,718 - toil.leader - WARNING - 5/v/job9c1JLI    WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit', name='cat.augustus', fromVirtualEnv=False)
2018-06-10 11:15:24,718 - toil.leader - WARNING - 5/v/job9c1JLI    WARNING:toil.resource:Can't find resource for leader path '/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit/cat'
2018-06-10 11:15:24,718 - toil.leader - WARNING - 5/v/job9c1JLI    WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit', name='cat.augustus', fromVirtualEnv=False)
2018-06-10 11:15:24,719 - toil.leader - WARNING - 5/v/job9c1JLI    Traceback (most recent call last):
2018-06-10 11:15:24,719 - toil.leader - WARNING - 5/v/job9c1JLI      File "/n/home09/nedelman/.conda/envs/CAT/lib/python2.7/site-packages/toil/worker.py", line 316, in main
2018-06-10 11:15:24,719 - toil.leader - WARNING - 5/v/job9c1JLI        job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
2018-06-10 11:15:24,719 - toil.leader - WARNING - 5/v/job9c1JLI      File "/n/home09/nedelman/.conda/envs/CAT/lib/python2.7/site-packages/toil/job.py", line 1318, in _runner
2018-06-10 11:15:24,720 - toil.leader - WARNING - 5/v/job9c1JLI        returnValues = self._run(jobGraph, fileStore)
2018-06-10 11:15:24,720 - toil.leader - WARNING - 5/v/job9c1JLI      File "/n/home09/nedelman/.conda/envs/CAT/lib/python2.7/site-packages/toil/job.py", line 1263, in _run
2018-06-10 11:15:24,736 - toil.leader - WARNING - 5/v/job9c1JLI        return self.run(fileStore)
2018-06-10 11:15:24,737 - toil.leader - WARNING - 5/v/job9c1JLI      File "/n/home09/nedelman/.conda/envs/CAT/lib/python2.7/site-packages/toil/job.py", line 1447, in run
2018-06-10 11:15:24,737 - toil.leader - WARNING - 5/v/job9c1JLI        rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
2018-06-10 11:15:24,738 - toil.leader - WARNING - 5/v/job9c1JLI      File "/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit/cat/augustus.py", line 141, in run_augustus_chunk
2018-06-10 11:15:24,738 - toil.leader - WARNING - 5/v/job9c1JLI        args.utr)
2018-06-10 11:15:24,738 - toil.leader - WARNING - 5/v/job9c1JLI      File "/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit/cat/augustus.py", line 170, in run_augustus
2018-06-10 11:15:24,738 - toil.leader - WARNING - 5/v/job9c1JLI        aug_output = tools.procOps.call_proc_lines(cmd)
2018-06-10 11:15:24,739 - toil.leader - WARNING - 5/v/job9c1JLI      File "/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit/tools/procOps.py", line 26, in call_proc_lines
2018-06-10 11:15:24,739 - toil.leader - WARNING - 5/v/job9c1JLI        out = call_proc(cmd)
2018-06-10 11:15:24,739 - toil.leader - WARNING - 5/v/job9c1JLI      File "/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit/tools/procOps.py", line 16, in call_proc
2018-06-10 11:15:24,739 - toil.leader - WARNING - 5/v/job9c1JLI        pl.wait()
2018-06-10 11:15:24,739 - toil.leader - WARNING - 5/v/job9c1JLI      File "/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit/tools/pipeline.py", line 1127, in wait
2018-06-10 11:15:24,740 - toil.leader - WARNING - 5/v/job9c1JLI        self.raiseIfExcept()
2018-06-10 11:15:24,740 - toil.leader - WARNING - 5/v/job9c1JLI      File "/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit/tools/pipeline.py", line 1085, in raiseIfExcept
2018-06-10 11:15:24,740 - toil.leader - WARNING - 5/v/job9c1JLI        p.raiseIfExcept()
2018-06-10 11:15:24,740 - toil.leader - WARNING - 5/v/job9c1JLI      File "/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit/tools/pipeline.py", line 749, in raiseIfExcept
2018-06-10 11:15:24,741 - toil.leader - WARNING - 5/v/job9c1JLI        raise self.exceptInfo[0], self.exceptInfo[1], self.exceptInfo[2]
2018-06-10 11:15:24,741 - toil.leader - WARNING - 5/v/job9c1JLI    ProcException: process exited 1: augustus /n/regal/mallet_lab/edelman/18Genomes/results/annotation/heliconiiniAlignment/CAT_annotationsOnly/toil-58bc7781-2b91-42cb-a716-25e18785266b-6b90140c-c47d-4883-96bb-c63ecfe4fd86/tmpA5eHvY/12d86d33-cdc8-4524-932c-39cad14b7734/holy2a08306.rc.fas.harvard.edu.38371.5261701051.tmp --predictionStart=-0 --predictionEnd=-0 --extrinsicCfgFile=/n/regal/mallet_lab/edelman/18Genomes/results/annotation/heliconiiniAlignment/CAT_annotationsOnly/toil-58bc7781-2b91-42cb-a716-25e18785266b-6b90140c-c47d-4883-96bb-c63ecfe4fd86/tmpA5eHvY/12d86d33-cdc8-4524-932c-39cad14b7734/tmpO9nNov.tmp --hintsfile=/n/regal/mallet_lab/edelman/18Genomes/results/annotation/heliconiiniAlignment/CAT_annotationsOnly/toil-58bc7781-2b91-42cb-a716-25e18785266b-6b90140c-c47d-4883-96bb-c63ecfe4fd86/tmpA5eHvY/12d86d33-cdc8-4524-932c-39cad14b7734/holy2a08306.rc.fas.harvard.edu.38371.4639986974.tmp --UTR=0 --alternatives-from-evidence=0 --species=heliconius_melpomene1 --allow_hinted_splicesites=atac --protein=0 --softmasking=1

And when I run the augustus command listed, I get the following output:

# This output was generated with AUGUSTUS (version 3.3).
# AUGUSTUS is a gene prediction tool written by M. Stanke (mario.stanke@uni-greifswald.de),
# O. Keller, S. König, L. Gerischer and L. Romoth.
# Please cite: Mario Stanke, Mark Diekhans, Robert Baertsch, David Haussler (2008),
# Using native and syntenically mapped cDNA alignments to improve de novo gene finding
# Bioinformatics 24: 637-644, doi 10.1093/bioinformatics/btn013
# Sources of extrinsic information: M RM E W T PB 
# Setting individual_liability for T.
# Setting acceptor splice site local malus: 0.1
# Setting donor splice site local malus: 0.1
# Setting exon local malus: 0.98
# Setting CDSpart local malus: 0.98
# Setting UTRpart local malus: 0.98
# reading in the file /n/regal/mallet_lab/edelman/18Genomes/results/annotation/heliconiiniAlignment/CAT_annotationsOnly/toil-58bc7781-2b91-42cb-a716-25e18785266b-6b90140c-c47d-4883-96bb-c63ecfe4fd86/tmpA5eHvY/12d86d33-cdc8-4524-932c-39cad14b7734/holy2a08306.rc.fas.harvard.edu.38371.4639986974.tmp ...
Error in hint line: line 1
Line not tab separated.
Maybe you used spaces instead of tabulators?
FeatureCollection::esource: invalid source key: ?

augustus: ERROR
    FeatureCollection::esource: invalid source key: ?

I took a look at the hintsFile, and the first line just says 'line 1'. If I remove that line and run the command again it works fine. Do you know why it's adding that line, or if there's an easy way to get rid of it? Thanks!

Nate

PS congrats on the Science paper!

ifiddes commented 6 years ago

Thanks!

This is really weird. My assumption at this point is that some update to one of the tools in the hints step changed it's behavior somehow. You have run this successfully before, correct? Did you update anything? Especially in the AUGUSTUS repository or in Kent tools.

Can you re-run the hints stage with --cleanWorkDir=never turned on? I am confused right now because I am not sure what tool is actually providing hints of the type t2h -- that is not one of the hint types (a2h comes from annotation on the reference, b2h comes from intron hints via blat2hints.pl, and w2h comes from wig2hints.pl converting BAM coverage to hints). Which hint types are you providing (BAM, INTRON_BAM, PROTEIN_FASTA, ISO_SEQ_BAM)?

ifiddes commented 6 years ago

Looking at the grp key, I am inferring you are passing previously assembled AUGUSTUS transcripts to the protein hints step. Is that correct?

This may be a result of behavior change of pslCheck... I will look into it. I really should swap genewise or something of that sort in for BLAT in that mode.

If you trust your transcripts, I would suggest submitting them to CAT as ANNOTATION instead of PROTEIN_FASTA. This will require you to have them in a GFF3 format that gff3ToGenePred can parse, but you don't actually need all of the extra biotype tags if it isn't the reference genome.

nbedelman commented 6 years ago

I have tun the test set successfully, but not this data. However, it is possible that I'm using a different version of some of the Kent binaries since our cluster just went though an upgrade and I had to update a number of paths. I'm actually only using the reference GFF3 file as an ANNOTATION - not using any BAMs at this point!

ifiddes commented 6 years ago

So HMEL031810g1.t1-0 is a transMap transcript then? Still confused on how it has the tag t2h and src=T in the GTF... those t's should be as. But all of this may be a red herring. If you can run it with --cleanWorkDir=never you can grep for line 1 in the toil folder for this module.

nbedelman commented 6 years ago

OK, so I update my version of pslCheck, deleted the hints_database, transMap, hgm, toil/hints_db, and toil/augustus directory and re-started using --cleanWorkDir=never. The only file in my config is a gff3 file under ANNOTATION for the reference genome. Now, I get the error:

2018-06-13 10:27:54,449 - toil.leader - WARNING - 7/D/jobis_Sf3    ---TOIL WORKER OUTPUT LOG---
2018-06-13 10:27:54,450 - toil.leader - WARNING - 7/D/jobis_Sf3    INFO:toil:Running Toil version 3.13.0a1.
2018-06-13 10:27:54,450 - toil.leader - WARNING - 7/D/jobis_Sf3    WARNING:toil.resource:Can't find resource for leader path '/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit/cat'
2018-06-13 10:27:54,450 - toil.leader - WARNING - 7/D/jobis_Sf3    WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit', name='cat.augustus', fromVirtualEnv=False)
2018-06-13 10:27:54,451 - toil.leader - WARNING - 7/D/jobis_Sf3    WARNING:toil.resource:Can't find resource for leader path '/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit/cat'
2018-06-13 10:27:54,451 - toil.leader - WARNING - 7/D/jobis_Sf3    WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit', name='cat.augustus', fromVirtualEnv=False)
2018-06-13 10:27:54,451 - toil.leader - WARNING - 7/D/jobis_Sf3    Traceback (most recent call last):
2018-06-13 10:27:54,452 - toil.leader - WARNING - 7/D/jobis_Sf3      File "/n/home09/nedelman/.conda/envs/CAT/lib/python2.7/site-packages/toil/worker.py", line 316, in main
2018-06-13 10:27:54,452 - toil.leader - WARNING - 7/D/jobis_Sf3        job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
2018-06-13 10:27:54,452 - toil.leader - WARNING - 7/D/jobis_Sf3      File "/n/home09/nedelman/.conda/envs/CAT/lib/python2.7/site-packages/toil/job.py", line 1318, in _runner
2018-06-13 10:27:54,452 - toil.leader - WARNING - 7/D/jobis_Sf3        returnValues = self._run(jobGraph, fileStore)
2018-06-13 10:27:54,453 - toil.leader - WARNING - 7/D/jobis_Sf3      File "/n/home09/nedelman/.conda/envs/CAT/lib/python2.7/site-packages/toil/job.py", line 1263, in _run
2018-06-13 10:27:54,453 - toil.leader - WARNING - 7/D/jobis_Sf3        return self.run(fileStore)
2018-06-13 10:27:54,453 - toil.leader - WARNING - 7/D/jobis_Sf3      File "/n/home09/nedelman/.conda/envs/CAT/lib/python2.7/site-packages/toil/job.py", line 1447, in run
2018-06-13 10:27:54,454 - toil.leader - WARNING - 7/D/jobis_Sf3        rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
2018-06-13 10:27:54,454 - toil.leader - WARNING - 7/D/jobis_Sf3      File "/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit/cat/augustus.py", line 141, in run_augustus_chunk
2018-06-13 10:27:54,454 - toil.leader - WARNING - 7/D/jobis_Sf3        args.utr)
2018-06-13 10:27:54,454 - toil.leader - WARNING - 7/D/jobis_Sf3      File "/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit/cat/augustus.py", line 171, in run_augustus
2018-06-13 10:27:54,455 - toil.leader - WARNING - 7/D/jobis_Sf3        transcript = munge_augustus_output(aug_output, mode, tm_tx)
2018-06-13 10:27:54,455 - toil.leader - WARNING - 7/D/jobis_Sf3      File "/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit/cat/augustus.py", line 213, in munge_augustus_output
2018-06-13 10:27:54,455 - toil.leader - WARNING - 7/D/jobis_Sf3        for chrom, source, feature, start, stop, score, strand, frame, attributes in tx_lines:
2018-06-13 10:27:54,456 - toil.leader - WARNING - 7/D/jobis_Sf3    ValueError: need more than 1 value to unpack 

I looked in the hints_database directory, and there are two non-empty files: one hints.gff for the reference, where every feature is a2h, and a binary file hints.db . There are empty hints.gff files for all other species, but I think that makes sense because I didn't provide any extrinsic files for them.

ifiddes commented 6 years ago

It made it past the hints stage to the augustus stage, which is good. However, now it appears that something is going wrong parsing the augustus output. My guess is that this is yet another version mismatch issue, this time with augustus. Did you run it with --cleanWorkDir=never? If so, there should be a merged GFF file somewhere in that toil directory (that will not be called something nice like merged.gff, it will be a random identifier, unfortunately. I should fix this). I think it should be in the subfolder with a path like 7/D/jobis_Sf3. There will be two such files, assuming you provided RNA-seq data (it will have run TM + TMR). Can you post what that file looks like? My guess is that there are comment lines in it that shouldn't be. What is your augustus version?

nbedelman commented 6 years ago

In this case, I didn't provide RNA-seq data, so I think it should only run TM. There was a directory named like you said, but the only files it has are a job file and a tmp file that just contains the error log. I'm using Augustus 3.3 . job.txt tmpbgTKX5.tmp.txt

ifiddes commented 6 years ago

Weird, I am using 3.2.2. Can you re-run the augustus step with --cleanWorkDir=never? This will let us see what actually gets put into that failing job.

nbedelman commented 6 years ago

Sorry it took so long to reply here, but I went back to make sure the test dataset completed successfully with my current environmental setup, which it did. I kept the chaining, reference, and genome_files directories in my work directory, deleted everything else, and re-started CAT with --cleanWorkDir=never. Still, it fails with the same errors and there's no additional file in the job directory! Could it be that I actually need to provide RNA-seq data? I thought that without the RNA-seq, it would just run augustus TM, and should be OK. Is this true? Thanks!

ifiddes commented 6 years ago

Sorry, I should have been more clear (really, I forgot how this all works). Toil has two directories that it works from -- the jobStore, which ends up being placed in the toil/ subdir, and the fileStore, which gets placed in in $TMPDIR unless--workDir is set to something, then the work will happen there. Toil uses the jobStore only to store the job graph, so nothing should change in that folder. So, what I should have told you is to re-run with --cleanWorkDir=never AND to set --workDir to some folder that exists. Sorry about that!

It should be just fine to run without RNA-seq. I am still not sure what is going on here, but I suspect it is a version mismatch. Joel told me he is going to try and get a intern over the summer to package all the dependencies in a docker container and generally create a fixed set so that these problems happen less.

On Wed, Jun 20, 2018 at 10:42 AM nbedelman notifications@github.com wrote:

Sorry it took so long to reply here, but I went back to make sure the test dataset completed successfully with my current environmental setup, which it did. I kept the chaining, reference, and genome_files directories in my work directory, deleted everything else, and re-started CAT with --cleanWorkDir=never. Still, it fails with the same errors and there's no additional file in the job directory! Could it be that I actually need to provide RNA-seq data? I thought that without the RNA-seq, it would just run augustus TM, and should be OK. Is this true? Thanks!

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/98#issuecomment-398836666, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXYnjA72aEC9LbOge4QFwtN74GqY6ks5t-ol8gaJpZM4Uhwkh .

-- Ian Fiddes, PhD

nbedelman commented 6 years ago

I see - I did set the work directory to an existing directory, but it is full of directories with names like toil-3c189b36-ecfa-4613-a63a-f307f3c534f1-cc454c2d-59d5-4a61-9d3e-1dbcaca24f7b, so I didn't think to look more deeply in there! I pulled out the most recent directory in that workDir, and dug down until I found a directory with the following files: tmpbTiFyv.tmp.txt genome.fasta.gdx.txt holy2a05108.rc.fas.harvard.edu.49826.4473637497.tmp.txt holy2a05108.rc.fas.harvard.edu.49826.1049369697.tmp.txt genome.fasta.flat.txt

and one called genome.fasta (too large to attach here) . Are these the files we need?

ifiddes commented 6 years ago

Yes, these look like the relevant files. genome.fasta is not necessary, it is just the full reference genome. I am taking a look at this now and will let you know if I figure anything out or need anything else.

On Wed, Jun 20, 2018 at 11:27 AM nbedelman notifications@github.com wrote:

I see - I did set the work directory to an existing directory, but it is full of directories with names like toil-3c189b36-ecfa-4613-a63a-f307f3c534f1-cc454c2d-59d5-4a61-9d3e-1dbcaca24f7b, so I didn't think to look more deeply in there! I pulled out the most recent directory in that workDir, and dug down until I found a directory with the following files: tmpbTiFyv.tmp.txt https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/files/2120565/tmpbTiFyv.tmp.txt genome.fasta.gdx.txt https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/files/2120566/genome.fasta.gdx.txt holy2a05108.rc.fas.harvard.edu.49826.4473637497.tmp.txt https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/files/2120576/holy2a05108.rc.fas.harvard.edu.49826.4473637497.tmp.txt holy2a05108.rc.fas.harvard.edu.49826.1049369697.tmp.txt https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/files/2120577/holy2a05108.rc.fas.harvard.edu.49826.1049369697.tmp.txt genome.fasta.flat.txt https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/files/2120578/genome.fasta.flat.txt

and one called genome.fasta (too large to attach here) . Are these the files we need?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/98#issuecomment-398850288, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXUcBmOu5s5k1uw1MIPQKj3OS8kFpks5t-pQEgaJpZM4Uhwkh .

-- Ian Fiddes, PhD

ifiddes-10x-zz commented 6 years ago

OK, I found the bug. It is related to the amount of output augustus produces. Sorry about that. Please try the new fix!

nbedelman commented 6 years ago

OK, it seems to be working now, thanks!