Closed nbedelman closed 6 years ago
Thanks!
This is really weird. My assumption at this point is that some update to one of the tools in the hints step changed it's behavior somehow. You have run this successfully before, correct? Did you update anything? Especially in the AUGUSTUS repository or in Kent tools.
Can you re-run the hints stage with --cleanWorkDir=never
turned on? I am confused right now because I am not sure what tool is actually providing hints of the type t2h
-- that is not one of the hint types (a2h
comes from annotation on the reference, b2h
comes from intron hints via blat2hints.pl
, and w2h
comes from wig2hints.pl
converting BAM coverage to hints). Which hint types are you providing (BAM, INTRON_BAM, PROTEIN_FASTA, ISO_SEQ_BAM)?
Looking at the grp
key, I am inferring you are passing previously assembled AUGUSTUS transcripts to the protein hints step. Is that correct?
This may be a result of behavior change of pslCheck
... I will look into it. I really should swap genewise or something of that sort in for BLAT in that mode.
If you trust your transcripts, I would suggest submitting them to CAT as ANNOTATION instead of PROTEIN_FASTA. This will require you to have them in a GFF3 format that gff3ToGenePred
can parse, but you don't actually need all of the extra biotype tags if it isn't the reference genome.
I have tun the test set successfully, but not this data. However, it is possible that I'm using a different version of some of the Kent binaries since our cluster just went though an upgrade and I had to update a number of paths. I'm actually only using the reference GFF3 file as an ANNOTATION - not using any BAMs at this point!
So HMEL031810g1.t1-0
is a transMap transcript then? Still confused on how it has the tag t2h
and src=T
in the GTF... those t
's should be a
s. But all of this may be a red herring. If you can run it with --cleanWorkDir=never
you can grep for line 1
in the toil folder for this module.
OK, so I update my version of pslCheck, deleted the hints_database, transMap, hgm, toil/hints_db, and toil/augustus directory and re-started using --cleanWorkDir=never. The only file in my config is a gff3 file under ANNOTATION for the reference genome. Now, I get the error:
2018-06-13 10:27:54,449 - toil.leader - WARNING - 7/D/jobis_Sf3 ---TOIL WORKER OUTPUT LOG---
2018-06-13 10:27:54,450 - toil.leader - WARNING - 7/D/jobis_Sf3 INFO:toil:Running Toil version 3.13.0a1.
2018-06-13 10:27:54,450 - toil.leader - WARNING - 7/D/jobis_Sf3 WARNING:toil.resource:Can't find resource for leader path '/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit/cat'
2018-06-13 10:27:54,450 - toil.leader - WARNING - 7/D/jobis_Sf3 WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit', name='cat.augustus', fromVirtualEnv=False)
2018-06-13 10:27:54,451 - toil.leader - WARNING - 7/D/jobis_Sf3 WARNING:toil.resource:Can't find resource for leader path '/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit/cat'
2018-06-13 10:27:54,451 - toil.leader - WARNING - 7/D/jobis_Sf3 WARNING:toil.resource:Can't localize module ModuleDescriptor(dirPath='/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit', name='cat.augustus', fromVirtualEnv=False)
2018-06-13 10:27:54,451 - toil.leader - WARNING - 7/D/jobis_Sf3 Traceback (most recent call last):
2018-06-13 10:27:54,452 - toil.leader - WARNING - 7/D/jobis_Sf3 File "/n/home09/nedelman/.conda/envs/CAT/lib/python2.7/site-packages/toil/worker.py", line 316, in main
2018-06-13 10:27:54,452 - toil.leader - WARNING - 7/D/jobis_Sf3 job._runner(jobGraph=jobGraph, jobStore=jobStore, fileStore=fileStore)
2018-06-13 10:27:54,452 - toil.leader - WARNING - 7/D/jobis_Sf3 File "/n/home09/nedelman/.conda/envs/CAT/lib/python2.7/site-packages/toil/job.py", line 1318, in _runner
2018-06-13 10:27:54,452 - toil.leader - WARNING - 7/D/jobis_Sf3 returnValues = self._run(jobGraph, fileStore)
2018-06-13 10:27:54,453 - toil.leader - WARNING - 7/D/jobis_Sf3 File "/n/home09/nedelman/.conda/envs/CAT/lib/python2.7/site-packages/toil/job.py", line 1263, in _run
2018-06-13 10:27:54,453 - toil.leader - WARNING - 7/D/jobis_Sf3 return self.run(fileStore)
2018-06-13 10:27:54,453 - toil.leader - WARNING - 7/D/jobis_Sf3 File "/n/home09/nedelman/.conda/envs/CAT/lib/python2.7/site-packages/toil/job.py", line 1447, in run
2018-06-13 10:27:54,454 - toil.leader - WARNING - 7/D/jobis_Sf3 rValue = userFunction(*((self,) + tuple(self._args)), **self._kwargs)
2018-06-13 10:27:54,454 - toil.leader - WARNING - 7/D/jobis_Sf3 File "/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit/cat/augustus.py", line 141, in run_augustus_chunk
2018-06-13 10:27:54,454 - toil.leader - WARNING - 7/D/jobis_Sf3 args.utr)
2018-06-13 10:27:54,454 - toil.leader - WARNING - 7/D/jobis_Sf3 File "/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit/cat/augustus.py", line 171, in run_augustus
2018-06-13 10:27:54,455 - toil.leader - WARNING - 7/D/jobis_Sf3 transcript = munge_augustus_output(aug_output, mode, tm_tx)
2018-06-13 10:27:54,455 - toil.leader - WARNING - 7/D/jobis_Sf3 File "/n/mallet_lab/edelman/software/Comparative-Annotation-Toolkit/cat/augustus.py", line 213, in munge_augustus_output
2018-06-13 10:27:54,455 - toil.leader - WARNING - 7/D/jobis_Sf3 for chrom, source, feature, start, stop, score, strand, frame, attributes in tx_lines:
2018-06-13 10:27:54,456 - toil.leader - WARNING - 7/D/jobis_Sf3 ValueError: need more than 1 value to unpack
I looked in the hints_database directory, and there are two non-empty files: one hints.gff for the reference, where every feature is a2h, and a binary file hints.db . There are empty hints.gff files for all other species, but I think that makes sense because I didn't provide any extrinsic files for them.
It made it past the hints stage to the augustus stage, which is good. However, now it appears that something is going wrong parsing the augustus output. My guess is that this is yet another version mismatch issue, this time with augustus. Did you run it with --cleanWorkDir=never
? If so, there should be a merged GFF file somewhere in that toil directory (that will not be called something nice like merged.gff, it will be a random identifier, unfortunately. I should fix this). I think it should be in the subfolder with a path like 7/D/jobis_Sf3
. There will be two such files, assuming you provided RNA-seq data (it will have run TM + TMR). Can you post what that file looks like? My guess is that there are comment lines in it that shouldn't be. What is your augustus version?
In this case, I didn't provide RNA-seq data, so I think it should only run TM. There was a directory named like you said, but the only files it has are a job file and a tmp file that just contains the error log. I'm using Augustus 3.3 . job.txt tmpbgTKX5.tmp.txt
Weird, I am using 3.2.2. Can you re-run the augustus step with --cleanWorkDir=never
? This will let us see what actually gets put into that failing job.
Sorry it took so long to reply here, but I went back to make sure the test dataset completed successfully with my current environmental setup, which it did. I kept the chaining, reference, and genome_files directories in my work directory, deleted everything else, and re-started CAT with --cleanWorkDir=never
. Still, it fails with the same errors and there's no additional file in the job directory! Could it be that I actually need to provide RNA-seq data? I thought that without the RNA-seq, it would just run augustus TM, and should be OK. Is this true? Thanks!
Sorry, I should have been more clear (really, I forgot how this all works).
Toil has two directories that it works from -- the jobStore, which ends up
being placed in the toil/ subdir, and the fileStore, which gets placed in
in $TMPDIR unless--workDir
is set to something, then the work will
happen there. Toil uses the jobStore only to store the job graph, so
nothing should change in that folder. So, what I should have told you is to
re-run with --cleanWorkDir=never
AND to set --workDir
to some folder
that exists. Sorry about that!
It should be just fine to run without RNA-seq. I am still not sure what is going on here, but I suspect it is a version mismatch. Joel told me he is going to try and get a intern over the summer to package all the dependencies in a docker container and generally create a fixed set so that these problems happen less.
On Wed, Jun 20, 2018 at 10:42 AM nbedelman notifications@github.com wrote:
Sorry it took so long to reply here, but I went back to make sure the test dataset completed successfully with my current environmental setup, which it did. I kept the chaining, reference, and genome_files directories in my work directory, deleted everything else, and re-started CAT with --cleanWorkDir=never. Still, it fails with the same errors and there's no additional file in the job directory! Could it be that I actually need to provide RNA-seq data? I thought that without the RNA-seq, it would just run augustus TM, and should be OK. Is this true? Thanks!
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/98#issuecomment-398836666, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXYnjA72aEC9LbOge4QFwtN74GqY6ks5t-ol8gaJpZM4Uhwkh .
-- Ian Fiddes, PhD
I see - I did set the work directory to an existing directory, but it is full of directories with names like toil-3c189b36-ecfa-4613-a63a-f307f3c534f1-cc454c2d-59d5-4a61-9d3e-1dbcaca24f7b
, so I didn't think to look more deeply in there! I pulled out the most recent directory in that workDir, and dug down until I found a directory with the following files:
tmpbTiFyv.tmp.txt
genome.fasta.gdx.txt
holy2a05108.rc.fas.harvard.edu.49826.4473637497.tmp.txt
holy2a05108.rc.fas.harvard.edu.49826.1049369697.tmp.txt
genome.fasta.flat.txt
and one called genome.fasta (too large to attach here) . Are these the files we need?
Yes, these look like the relevant files. genome.fasta is not necessary, it is just the full reference genome. I am taking a look at this now and will let you know if I figure anything out or need anything else.
On Wed, Jun 20, 2018 at 11:27 AM nbedelman notifications@github.com wrote:
I see - I did set the work directory to an existing directory, but it is full of directories with names like toil-3c189b36-ecfa-4613-a63a-f307f3c534f1-cc454c2d-59d5-4a61-9d3e-1dbcaca24f7b, so I didn't think to look more deeply in there! I pulled out the most recent directory in that workDir, and dug down until I found a directory with the following files: tmpbTiFyv.tmp.txt https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/files/2120565/tmpbTiFyv.tmp.txt genome.fasta.gdx.txt https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/files/2120566/genome.fasta.gdx.txt holy2a05108.rc.fas.harvard.edu.49826.4473637497.tmp.txt https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/files/2120576/holy2a05108.rc.fas.harvard.edu.49826.4473637497.tmp.txt holy2a05108.rc.fas.harvard.edu.49826.1049369697.tmp.txt https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/files/2120577/holy2a05108.rc.fas.harvard.edu.49826.1049369697.tmp.txt genome.fasta.flat.txt https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/files/2120578/genome.fasta.flat.txt
and one called genome.fasta (too large to attach here) . Are these the files we need?
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/98#issuecomment-398850288, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXUcBmOu5s5k1uw1MIPQKj3OS8kFpks5t-pQEgaJpZM4Uhwkh .
-- Ian Fiddes, PhD
OK, I found the bug. It is related to the amount of output augustus produces. Sorry about that. Please try the new fix!
OK, it seems to be working now, thanks!
I'm running into a seemingly minor issue with the first augustus step in the pipeline. I get the error below:
And when I run the augustus command listed, I get the following output:
I took a look at the hintsFile, and the first line just says 'line 1'. If I remove that line and run the command again it works fine. Do you know why it's adding that line, or if there's an easy way to get rid of it? Thanks!
Nate
PS congrats on the Science paper!