Closed dengyuan7 closed 6 years ago
Hmmm, this is probably a bug not an actual problem with the requirements. Did you set any requirements? They include things like intron-rnaseq-support
and other support-level flags. In the default mode, these are all maximally permissive.
How many transcripts survived transMap filtering? If you do wc -l transMap/Nestor_notabilis*
what do you see? Are you running with RNA-seq, or just with transMap?
Hi Ifiddes,
I used the default options without setting any requirements, so I guess it's not the source of the error. I entered the protein information for each genome. There are also outputs of augustus:
3 transMap/Nestor_notabilis.filtered.gp 3 transMap/Nestor_notabilis.filtered.psl 3 transMap/Nestor_notabilis.gp 75 transMap/Nestor_notabilis.gtf 3 transMap/Nestor_notabilis.psl 4 transMap/Nipponia_nippon.filtered.gp 4 transMap/Nipponia_nippon.filtered.psl 4 transMap/Nipponia_nippon.gp 92 transMap/Nipponia_nippon.gtf 4 transMap/Nipponia_nippon.psl
3 augustus/Nestor_notabilis.augTM.gp 80 augustus/Nestor_notabilis.augTM.gtf 4 augustus/Nipponia_nippon.augTM.gp 102 augustus/Nipponia_nippon.augTM.gtf
I listed the lines of Nipponia_nippon as a comparison since it successfully got the consensus transcripts. The transcripts are so small because I only entered four genes.
Can you send me the contents of the transMap folder for these two genomes? I will try to see what happened.
I uploaded a package of transMap folder on Google driver: https://drive.google.com/open?id=1QPlF9mH3XdhNj_bufcrZk-OEvDh5VANE
I don't see any problem with these files. The coverage and identity is a bit lower in Nipponia_nippon
, but well within the range that CAT should be able to handle. Something more subtle is going on. I just created a new branch, consensus_bug
. Can you check out this branch and re-run the code? This is going to dump all of the data going into the consensus finding step to a pickle file in the consensus output directory that you can send to me so I can try to figure this out. Thanks!
Dear @ifiddes-10x ,
I finished the running of consensus_bug
, it failed at HgmDriverTask.
08-09 05:29:18 luigi-interface ERROR [pid 67036] Worker Worker(salt=375807751,workers=1,host=cngb-supermem-a14-1.cngb.sz.hpc, username=dengyuan, pid=67036) failed Task: HgmDriverTask for augTM
Traceback (most recent call last):
File "/hwfssz1/ST_DIVERSITY/PUB/USER/dengyuan/software/Python-2.7.13/lib/python2.7/site-packages/luigi/worker.py", line 203, in run
new_deps = self._run_get_new_deps()
File "/hwfssz1/ST_DIVERSITY/PUB/USER/dengyuan/software/Python-2.7.13/lib/python2.7/site-packages/luigi/worker.py", line 140, in _run_get_new_deps
task_gen = self.task.run()
File "/hwfssz5/ST_DIVERSITY/B10K/PUB/USER/dengyuan/consensus_bug/Comparative-Annotation-Toolkit/cat/__init__.py", line 1725, in run
df = parse_hgm_gtf(hgm_args.gtf_out_files[genome], genome)
File "/hwfssz5/ST_DIVERSITY/B10K/PUB/USER/dengyuan/consensus_bug/Comparative-Annotation-Toolkit/cat/hgm.py", line 227, in parse_hgm_gtf
with open(hgm_out) as infile:
IOError: [Errno 2] No such file or directory: '/hwfssz5/ST_DIVERSITY/B10K/USER/dengyuan/CAT/4_genes/cat_work-consensus_bug/hgm/augTM/Struthio_camelus.gtf'
It successfully generated the hgm tables for transMap files, and Struthio_camelus.augTM.gtf
exist in the directory cat_work-consensus_bug/augustus
. It's weird that there is no any error of running consensus_bug
for the test data.
Uhoh, I think this is the same problem as issue #102 -- something is wrong with homGeneMapping
that doesn't happen on the test dataset for a reason I don't know yet. I asked the reporter in that issue to provide me the input files, but he hasn't gotten back to me yet.
Since I see you are part of B10K, is the input to this a HAL file that UCSC / @joelarmstrong generated? Is there a reasonable way to get the input files to me so I can try to reproduce this? I think all I actually would need is the HAL and your input GFF3, so if Joel has the HAL already I just need your GFF3.
Also, did you upgrade your homGeneMapping
installation between the two runs? Possibly due to the fix for augustusCGP that was added to 3.3.1?
Seing the same issue now on my PRET dataset. I upgraded from augustus 3.3.1 to the latest GitHub commit during the run to fix #109. I could package the consensus input data and post it to Mario if needed. Would be great to get that fixed. F
If you can build a simple test set that triggers this to pass to Mario, that would be awesome. @joelarmstrong was able to construct one last time that made it much easier for them to fix the problem. Let me know if you need any help hacking the pipeline to construct such a set. Sorry for the headaches, hopefully as we fix these bugs we can eventually set up the Docker packaging such that this kind of thing doesn't happen anymore.
You can get this same Consider lowering requirements. Please see the manual.
issue when using NCBI GFF3s, I think--we recently ran into a GFF parsing issue with the rhino GFF. (Working on a fix, but as a kludge, we were able to change around the inner joins in the consensus code to get it to run properly.)
oh, I misread which original issue this is on. Sorry, I was thinking this was an augustus issue. If you are experiencing the lowering requirements message, then yes it is probably a problem with the GFF3 parser. Did you try on the latest master?
Joel, do you know what in the parser broke for rhino? Send it along to me and I can take a look. Is this newer than our discussions last week?
I was testing the fbemm branch that was supposed to fix the prediction of non-coding genes. Attempt to use master again?
Sorry, I forgot to mention that I made another attempt in master to re-work the GFF3 parser to handle the issue for a different annotation set. Before you go ahead, let me test master against your GFF3. It was the guppy one, right?
Basically, the Consider lowering requirements. Please see the manual.
bug shows up when the results of the GFF3 parser has different gene names than what gff3ToGenePred
produces. I need to add a check for this that kills the program there rather than let the whole thing go to the end. When there is a mismatch, the inner join that Joel was referring to will break things.
So, let me make sure that your reference works against the current master, and add that test, then you can try again.
Yeah, I am was referring to the guppy one ;)
re: what broke in the parser, I can track it down to the attrsOut parser getting one set of transcript ID->gene ID mappings, while gff3ToGenePred gets a different set. The reference database generated by that parser generates records that look something like this:
TranscriptId GeneId TranscriptName GeneName GeneBiotype TranscriptBiotype
0 NM_001081757.1 rna28419 2'-5'-oligoadenylate synthetase 3 OAS3 protein_coding protein_coding
which seems to map a transcript back to its RNA, not to its gene. The genePred seems to give a more sensible transcript->gene mapping, though (NB this is for a different transcript):
rna3935 NC_009144.3 + 127306519 127313991 127306589 127308782 4 127306519,127307753,127308668,127313055, 127306679,127307843,127308800,127313991, 0 gene1486 cmpl cmpl 0,0,0,-1,
I think a reasonable solution would be to take the parent->child mappings from the genePred, and only take the biotype and names from the attrsOut file; would that make sense?
That seems sane to me, that way it is guaranteed that the transcriptID -> GeneID mapping is consistent. The other attributes are somewhat superfluous, except for coding vs. non-coding. I can work on this change later today.
On Wed, Aug 22, 2018 at 2:23 PM Joel Armstrong notifications@github.com wrote:
re: what broke in the parser, I can track it down to the attrsOut parser getting one set of transcript ID->gene ID mappings, while gff3ToGenePred gets a different set. The reference database generated by that parser generates records that look something like this:
TranscriptId GeneId TranscriptName GeneName GeneBiotype TranscriptBiotype
0 NM_001081757.1 rna28419 2'-5'-oligoadenylate synthetase 3 OAS3 protein_coding protein_coding
which seems to map a transcript back to its RNA, not to its gene. The genePred seems to give a more sensible transcript->gene mapping, though (NB this is for a different transcript):
rna3935 NC_009144.3 + 127306519 127313991 127306589 127308782 4 127306519,127307753,127308668,127313055, 127306679,127307843,127308800,127313991, 0 gene1486 cmpl cmpl 0,0,0,-1,
I think a reasonable solution would be to take the parent->child mappings from the genePred, and only take the biotype and names from the attrsOut file; would that make sense?
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/105#issuecomment-415185417, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXRYW7bfNRqOAz0gD8pmtl38rNHxpks5uTcvngaJpZM4Vr9mG .
-- Ian Fiddes, PhD
Anything that I can test run over the weekend?
Ok, I have the new branch gff3_parser_rewrite
that follows @joelarmstrong's advice and relies explicitly on the tx_id -> gene_id mapping of the gff3ToGenePred output. In my test it worked for the Guppy reference as well as for Ensembl and GENCODE human. For guppy, I see the following transcript type counts:
Counter({'C_region': 7,
'V_segment': 61,
'misc_RNA': 801,
'ncRNA': 3788,
'protein_coding': 43715,
'rRNA': 2,
'tRNA': 360})
Please try it out!
Getting:
Traceback (most recent call last):
File "/ebio/abt6_projects9/abt6_software/bin/cat/env/lib/python2.7/site-packages/luigi/worker.py", line 205, in run
new_deps = self._run_get_new_deps()
File "/ebio/abt6_projects9/abt6_software/bin/cat/env/lib/python2.7/site-packages/luigi/worker.py", line 142, in _run_get_new_deps
task_gen = self.task.run()
File "/ebio/abt6_projects9/abt6_software/bin/cat/cat/__init__.py", line 801, in run
tx_dict = tools.transcripts.get_gene_pred_dict(args.annotation_gp)
NameError: global name 'args' is not defined
ERROR:luigi-interface:[pid 22225] Worker Worker(salt=517015710, workers=16, host=burrito, username=fbemm, pid=21378) failed Task: Gff3ToAttrs
Traceback (most recent call last):
File "/ebio/abt6_projects9/abt6_software/bin/cat/env/lib/python2.7/site-packages/luigi/worker.py", line 205, in run
new_deps = self._run_get_new_deps()
File "/ebio/abt6_projects9/abt6_software/bin/cat/env/lib/python2.7/site-packages/luigi/worker.py", line 142, in _run_get_new_deps
task_gen = self.task.run()
File "/ebio/abt6_projects9/abt6_software/bin/cat/cat/__init__.py", line 801, in run
tx_dict = tools.transcripts.get_gene_pred_dict(args.annotation_gp)
NameError: global name 'args' is not defined
Looks like I messed up something during the update.
Oops! I just pushed a fix for this. That is what I get for not running the test set because I was on a train...
On Fri, Aug 24, 2018 at 2:36 PM Felix Bemm notifications@github.com wrote:
Getting:
Traceback (most recent call last): File "/ebio/abt6_projects9/abt6_software/bin/cat/env/lib/python2.7/site-packages/luigi/worker.py", line 205, in run new_deps = self._run_get_new_deps() File "/ebio/abt6_projects9/abt6_software/bin/cat/env/lib/python2.7/site-packages/luigi/worker.py", line 142, in _run_get_new_deps task_gen = self.task.run() File "/ebio/abt6_projects9/abt6_software/bin/cat/cat/init.py", line 801, in run tx_dict = tools.transcripts.get_gene_pred_dict(args.annotation_gp) NameError: global name 'args' is not defined ERROR:luigi-interface:[pid 22225] Worker Worker(salt=517015710, workers=16, host=burrito, username=fbemm, pid=21378) failed Task: Gff3ToAttrs Traceback (most recent call last): File "/ebio/abt6_projects9/abt6_software/bin/cat/env/lib/python2.7/site-packages/luigi/worker.py", line 205, in run new_deps = self._run_get_new_deps() File "/ebio/abt6_projects9/abt6_software/bin/cat/env/lib/python2.7/site-packages/luigi/worker.py", line 142, in _run_get_new_deps task_gen = self.task.run() File "/ebio/abt6_projects9/abt6_software/bin/cat/cat/init.py", line 801, in run tx_dict = tools.transcripts.get_gene_pred_dict(args.annotation_gp) NameError: global name 'args' is not defined
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/105#issuecomment-415888969, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXd6skbBTGOjBqnKmjkg-1sAo6kMQks5uUHH5gaJpZM4Vr9mG .
-- Ian Fiddes, PhD
Runtime error: Traceback (most recent call last): File "/ebio/abt6/fbemm/.miniconda2/lib/python2.7/site-packages/luigi/worker.py", line 199, in run new_deps = self._run_get_new_deps() File "/ebio/abt6/fbemm/.miniconda2/lib/python2.7/site-packages/luigi/worker.py", line 139, in _run_get_new_deps task_gen = self.task.run() File "/ebio/abt6_projects9/abt6_software/bin/cat/cat/init.py", line 855, in run results.append([gene_id, tx_id, tx_name, gene_name, gene_biotype, tx_biotype]) UnboundLocalError: local variable 'tx_name' referenced before assignment
;)
Interesting, I ran the guppy GFF3 through this in a notebook and it worked... let me take a look
On Fri, Aug 24, 2018 at 3:09 PM Felix Bemm notifications@github.com wrote:
Runtime error:
Traceback (most recent call last):
File "/ebio/abt6/fbemm/.miniconda2/lib/python2.7/site-packages/luigi/worker.py", line 199, in run new_deps = self._run_get_new_deps() File "/ebio/abt6/fbemm/.miniconda2/lib/python2.7/site-packages/luigi/worker.py", line 139, in _run_get_new_deps task_gen = self.task.run() File "/ebio/abt6_projects9/abt6_software/bin/cat/cat/init.py", line 855, in run results.append([gene_id, tx_id, tx_name, gene_name, gene_biotype, tx_biotype]) UnboundLocalError: local variable 'tx_name' referenced before assignment
;)
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/105#issuecomment-415895321, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXQ1n7xJ5alEUE764Qx2Sh5SohSP-ks5uUHmXgaJpZM4Vr9mG .
-- Ian Fiddes, PhD
Ok try now (I had the indentation different in the notebook... sigh)
On Fri, Aug 24, 2018 at 3:11 PM Ian Fiddes ian.t.fiddes@gmail.com wrote:
Interesting, I ran the guppy GFF3 through this in a notebook and it worked... let me take a look
On Fri, Aug 24, 2018 at 3:09 PM Felix Bemm notifications@github.com wrote:
Runtime error:
Traceback (most recent call last):
File "/ebio/abt6/fbemm/.miniconda2/lib/python2.7/site-packages/luigi/worker.py", line 199, in run new_deps = self._run_get_new_deps() File "/ebio/abt6/fbemm/.miniconda2/lib/python2.7/site-packages/luigi/worker.py", line 139, in _run_get_new_deps task_gen = self.task.run() File "/ebio/abt6_projects9/abt6_software/bin/cat/cat/init.py", line 855, in run results.append([gene_id, tx_id, tx_name, gene_name, gene_biotype, tx_biotype]) UnboundLocalError: local variable 'tx_name' referenced before assignment
;)
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/105#issuecomment-415895321, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXQ1n7xJ5alEUE764Qx2Sh5SohSP-ks5uUHmXgaJpZM4Vr9mG .
-- Ian Fiddes, PhD
-- Ian Fiddes, PhD
I removed some duplicates back in the early times of CAT. Let me try with the original NCBI GFF3.
Chaining and generating hints. SO we are running. Thx! I'll keep you updated.
There was a bug, I found it. However, you should be able to use the NCBI GFF3 as-is now (as long as your gff3ToGenePred is new enough, Mark fixed the problem sometime in March or so).
On Fri, Aug 24, 2018 at 3:14 PM Felix Bemm notifications@github.com wrote:
I removed some duplicates back in the early times of CAT. Let me try with the original NCBI GFF3.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/105#issuecomment-415896175, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXcQVw87utbu01HhBWg3OjxWswOQdks5uUHqugaJpZM4Vr9mG .
-- Ian Fiddes, PhD
Closing issue here, keeping #97 open for now.
I got following error at almost the final step, ConsensusDriverTask:
08-01 08:33:56 cat INFO Generating consensus gene set for Nestor_notabilis. 08-01 08:33:57 luigi-interface ERROR [pid 25712] Worker Worker(salt=800833810, workers=1, host=cngb-supermem-a15-2.cngb.sz.hpc, username=dengyuan, pid=25712) failed Task: ConsensusDriverTask for Nestor_notabilis Traceback (most recent call last): File "/hwfssz5/ST_DIVERSITY/B10K/PUB/local/compilation_tool/Python-2.7.15/lib/python2.7/site-packages/luigi/worker.py", line 205, in run new_deps = self._run_get_new_deps() File "/hwfssz5/ST_DIVERSITY/B10K/PUB/local/compilation_tool/Python-2.7.15/lib/python2.7/site-packages/luigi/worker.py", line 142, in _run_get_new_deps task_gen = self.task.run() File "/hwfssz5/ST_DIVERSITY/B10K/PUB/local/basic_bio-tool/Comparative-Annotation-Toolkit/cat/__init__.py", line 2070, in run metrics_dict = generate_consensus(consensus_args) File "/hwfssz5/ST_DIVERSITY/B10K/PUB/local/basic_bio-tool/Comparative-Annotation-Toolkit/cat/consensus.py", line 75, in generate_consensus 'Consider lowering requirements. Please see the manual.'.format(args.genome)) RuntimeError: No transcripts pass filtering for species Nestor_notabilis. Consider lowering requirements. Please see the manual.
It gave an advice to lower requirements. However, I didn't find the options related to the Consensus tasks. Which options can I set? Thanks!