ComparativeGenomicsToolkit / Comparative-Annotation-Toolkit

Apache License 2.0
164 stars 48 forks source link

ConsensusDriverTask - how to lower the requirements of transcripts #105

Closed dengyuan7 closed 6 years ago

dengyuan7 commented 6 years ago

I got following error at almost the final step, ConsensusDriverTask:

08-01 08:33:56 cat INFO Generating consensus gene set for Nestor_notabilis. 08-01 08:33:57 luigi-interface ERROR [pid 25712] Worker Worker(salt=800833810, workers=1, host=cngb-supermem-a15-2.cngb.sz.hpc, username=dengyuan, pid=25712) failed Task: ConsensusDriverTask for Nestor_notabilis Traceback (most recent call last): File "/hwfssz5/ST_DIVERSITY/B10K/PUB/local/compilation_tool/Python-2.7.15/lib/python2.7/site-packages/luigi/worker.py", line 205, in run new_deps = self._run_get_new_deps() File "/hwfssz5/ST_DIVERSITY/B10K/PUB/local/compilation_tool/Python-2.7.15/lib/python2.7/site-packages/luigi/worker.py", line 142, in _run_get_new_deps task_gen = self.task.run() File "/hwfssz5/ST_DIVERSITY/B10K/PUB/local/basic_bio-tool/Comparative-Annotation-Toolkit/cat/__init__.py", line 2070, in run metrics_dict = generate_consensus(consensus_args) File "/hwfssz5/ST_DIVERSITY/B10K/PUB/local/basic_bio-tool/Comparative-Annotation-Toolkit/cat/consensus.py", line 75, in generate_consensus 'Consider lowering requirements. Please see the manual.'.format(args.genome)) RuntimeError: No transcripts pass filtering for species Nestor_notabilis. Consider lowering requirements. Please see the manual.

It gave an advice to lower requirements. However, I didn't find the options related to the Consensus tasks. Which options can I set? Thanks!

ifiddes-10x-zz commented 6 years ago

Hmmm, this is probably a bug not an actual problem with the requirements. Did you set any requirements? They include things like intron-rnaseq-support and other support-level flags. In the default mode, these are all maximally permissive.

How many transcripts survived transMap filtering? If you do wc -l transMap/Nestor_notabilis* what do you see? Are you running with RNA-seq, or just with transMap?

dengyuan7 commented 6 years ago

Hi Ifiddes,

I used the default options without setting any requirements, so I guess it's not the source of the error. I entered the protein information for each genome. There are also outputs of augustus: 3 transMap/Nestor_notabilis.filtered.gp 3 transMap/Nestor_notabilis.filtered.psl 3 transMap/Nestor_notabilis.gp 75 transMap/Nestor_notabilis.gtf 3 transMap/Nestor_notabilis.psl 4 transMap/Nipponia_nippon.filtered.gp 4 transMap/Nipponia_nippon.filtered.psl 4 transMap/Nipponia_nippon.gp 92 transMap/Nipponia_nippon.gtf 4 transMap/Nipponia_nippon.psl 3 augustus/Nestor_notabilis.augTM.gp 80 augustus/Nestor_notabilis.augTM.gtf 4 augustus/Nipponia_nippon.augTM.gp 102 augustus/Nipponia_nippon.augTM.gtf I listed the lines of Nipponia_nippon as a comparison since it successfully got the consensus transcripts. The transcripts are so small because I only entered four genes.

ifiddes-10x-zz commented 6 years ago

Can you send me the contents of the transMap folder for these two genomes? I will try to see what happened.

dengyuan7 commented 6 years ago

I uploaded a package of transMap folder on Google driver: https://drive.google.com/open?id=1QPlF9mH3XdhNj_bufcrZk-OEvDh5VANE

ifiddes-10x-zz commented 6 years ago

I don't see any problem with these files. The coverage and identity is a bit lower in Nipponia_nippon, but well within the range that CAT should be able to handle. Something more subtle is going on. I just created a new branch, consensus_bug. Can you check out this branch and re-run the code? This is going to dump all of the data going into the consensus finding step to a pickle file in the consensus output directory that you can send to me so I can try to figure this out. Thanks!

dengyuan7 commented 6 years ago

Dear @ifiddes-10x ,

I finished the running of consensus_bug, it failed at HgmDriverTask.

08-09 05:29:18 luigi-interface ERROR [pid 67036] Worker Worker(salt=375807751,workers=1,host=cngb-supermem-a14-1.cngb.sz.hpc, username=dengyuan, pid=67036) failed    Task: HgmDriverTask for augTM
Traceback (most recent call last):
  File "/hwfssz1/ST_DIVERSITY/PUB/USER/dengyuan/software/Python-2.7.13/lib/python2.7/site-packages/luigi/worker.py", line 203, in run
    new_deps = self._run_get_new_deps()
  File "/hwfssz1/ST_DIVERSITY/PUB/USER/dengyuan/software/Python-2.7.13/lib/python2.7/site-packages/luigi/worker.py", line 140, in _run_get_new_deps
    task_gen = self.task.run()
  File "/hwfssz5/ST_DIVERSITY/B10K/PUB/USER/dengyuan/consensus_bug/Comparative-Annotation-Toolkit/cat/__init__.py", line 1725, in run
    df = parse_hgm_gtf(hgm_args.gtf_out_files[genome], genome)
  File "/hwfssz5/ST_DIVERSITY/B10K/PUB/USER/dengyuan/consensus_bug/Comparative-Annotation-Toolkit/cat/hgm.py", line 227, in parse_hgm_gtf
    with open(hgm_out) as infile:
IOError: [Errno 2] No such file or directory: '/hwfssz5/ST_DIVERSITY/B10K/USER/dengyuan/CAT/4_genes/cat_work-consensus_bug/hgm/augTM/Struthio_camelus.gtf'

It successfully generated the hgm tables for transMap files, and Struthio_camelus.augTM.gtf exist in the directory cat_work-consensus_bug/augustus. It's weird that there is no any error of running consensus_bug for the test data.

ifiddes-10x-zz commented 6 years ago

Uhoh, I think this is the same problem as issue #102 -- something is wrong with homGeneMapping that doesn't happen on the test dataset for a reason I don't know yet. I asked the reporter in that issue to provide me the input files, but he hasn't gotten back to me yet.

Since I see you are part of B10K, is the input to this a HAL file that UCSC / @joelarmstrong generated? Is there a reasonable way to get the input files to me so I can try to reproduce this? I think all I actually would need is the HAL and your input GFF3, so if Joel has the HAL already I just need your GFF3.

ifiddes-10x-zz commented 6 years ago

Also, did you upgrade your homGeneMapping installation between the two runs? Possibly due to the fix for augustusCGP that was added to 3.3.1?

fbemm commented 6 years ago

Seing the same issue now on my PRET dataset. I upgraded from augustus 3.3.1 to the latest GitHub commit during the run to fix #109. I could package the consensus input data and post it to Mario if needed. Would be great to get that fixed. F

ifiddes-10x-zz commented 6 years ago

If you can build a simple test set that triggers this to pass to Mario, that would be awesome. @joelarmstrong was able to construct one last time that made it much easier for them to fix the problem. Let me know if you need any help hacking the pipeline to construct such a set. Sorry for the headaches, hopefully as we fix these bugs we can eventually set up the Docker packaging such that this kind of thing doesn't happen anymore.

joelarmstrong commented 6 years ago

You can get this same Consider lowering requirements. Please see the manual. issue when using NCBI GFF3s, I think--we recently ran into a GFF parsing issue with the rhino GFF. (Working on a fix, but as a kludge, we were able to change around the inner joins in the consensus code to get it to run properly.)

ifiddes-10x-zz commented 6 years ago

oh, I misread which original issue this is on. Sorry, I was thinking this was an augustus issue. If you are experiencing the lowering requirements message, then yes it is probably a problem with the GFF3 parser. Did you try on the latest master?

Joel, do you know what in the parser broke for rhino? Send it along to me and I can take a look. Is this newer than our discussions last week?

fbemm commented 6 years ago

I was testing the fbemm branch that was supposed to fix the prediction of non-coding genes. Attempt to use master again?

ifiddes-10x-zz commented 6 years ago

Sorry, I forgot to mention that I made another attempt in master to re-work the GFF3 parser to handle the issue for a different annotation set. Before you go ahead, let me test master against your GFF3. It was the guppy one, right?

Basically, the Consider lowering requirements. Please see the manual. bug shows up when the results of the GFF3 parser has different gene names than what gff3ToGenePred produces. I need to add a check for this that kills the program there rather than let the whole thing go to the end. When there is a mismatch, the inner join that Joel was referring to will break things.

So, let me make sure that your reference works against the current master, and add that test, then you can try again.

fbemm commented 6 years ago

Yeah, I am was referring to the guppy one ;)

joelarmstrong commented 6 years ago

re: what broke in the parser, I can track it down to the attrsOut parser getting one set of transcript ID->gene ID mappings, while gff3ToGenePred gets a different set. The reference database generated by that parser generates records that look something like this:

          TranscriptId    GeneId                                     TranscriptName   GeneName     GeneBiotype TranscriptBiotype
0      NM_001081757.1  rna28419                  2'-5'-oligoadenylate synthetase 3       OAS3  protein_coding    protein_coding

which seems to map a transcript back to its RNA, not to its gene. The genePred seems to give a more sensible transcript->gene mapping, though (NB this is for a different transcript):

rna3935 NC_009144.3     +       127306519       127313991       127306589       127308782       4       127306519,127307753,127308668,127313055,        127306679,127307843,127308800,127313991,        0       gene1486        cmpl    cmpl    0,0,0,-1,

I think a reasonable solution would be to take the parent->child mappings from the genePred, and only take the biotype and names from the attrsOut file; would that make sense?

ifiddes commented 6 years ago

That seems sane to me, that way it is guaranteed that the transcriptID -> GeneID mapping is consistent. The other attributes are somewhat superfluous, except for coding vs. non-coding. I can work on this change later today.

On Wed, Aug 22, 2018 at 2:23 PM Joel Armstrong notifications@github.com wrote:

re: what broke in the parser, I can track it down to the attrsOut parser getting one set of transcript ID->gene ID mappings, while gff3ToGenePred gets a different set. The reference database generated by that parser generates records that look something like this:

      TranscriptId    GeneId                                     TranscriptName   GeneName     GeneBiotype TranscriptBiotype

0 NM_001081757.1 rna28419 2'-5'-oligoadenylate synthetase 3 OAS3 protein_coding protein_coding

which seems to map a transcript back to its RNA, not to its gene. The genePred seems to give a more sensible transcript->gene mapping, though (NB this is for a different transcript):

rna3935 NC_009144.3 + 127306519 127313991 127306589 127308782 4 127306519,127307753,127308668,127313055, 127306679,127307843,127308800,127313991, 0 gene1486 cmpl cmpl 0,0,0,-1,

I think a reasonable solution would be to take the parent->child mappings from the genePred, and only take the biotype and names from the attrsOut file; would that make sense?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/105#issuecomment-415185417, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXRYW7bfNRqOAz0gD8pmtl38rNHxpks5uTcvngaJpZM4Vr9mG .

-- Ian Fiddes, PhD

fbemm commented 6 years ago

Anything that I can test run over the weekend?

ifiddes commented 6 years ago

Ok, I have the new branch gff3_parser_rewrite that follows @joelarmstrong's advice and relies explicitly on the tx_id -> gene_id mapping of the gff3ToGenePred output. In my test it worked for the Guppy reference as well as for Ensembl and GENCODE human. For guppy, I see the following transcript type counts:

Counter({'C_region': 7,
         'V_segment': 61,
         'misc_RNA': 801,
         'ncRNA': 3788,
         'protein_coding': 43715,
         'rRNA': 2,
         'tRNA': 360})

Please try it out!

fbemm commented 6 years ago

Getting:

Traceback (most recent call last):
  File "/ebio/abt6_projects9/abt6_software/bin/cat/env/lib/python2.7/site-packages/luigi/worker.py", line 205, in run
    new_deps = self._run_get_new_deps()
  File "/ebio/abt6_projects9/abt6_software/bin/cat/env/lib/python2.7/site-packages/luigi/worker.py", line 142, in _run_get_new_deps
    task_gen = self.task.run()
  File "/ebio/abt6_projects9/abt6_software/bin/cat/cat/__init__.py", line 801, in run
    tx_dict = tools.transcripts.get_gene_pred_dict(args.annotation_gp)
NameError: global name 'args' is not defined
ERROR:luigi-interface:[pid 22225] Worker Worker(salt=517015710, workers=16, host=burrito, username=fbemm, pid=21378) failed    Task: Gff3ToAttrs
Traceback (most recent call last):
  File "/ebio/abt6_projects9/abt6_software/bin/cat/env/lib/python2.7/site-packages/luigi/worker.py", line 205, in run
    new_deps = self._run_get_new_deps()
  File "/ebio/abt6_projects9/abt6_software/bin/cat/env/lib/python2.7/site-packages/luigi/worker.py", line 142, in _run_get_new_deps
    task_gen = self.task.run()
  File "/ebio/abt6_projects9/abt6_software/bin/cat/cat/__init__.py", line 801, in run
    tx_dict = tools.transcripts.get_gene_pred_dict(args.annotation_gp)
NameError: global name 'args' is not defined

Looks like I messed up something during the update.

ifiddes commented 6 years ago

Oops! I just pushed a fix for this. That is what I get for not running the test set because I was on a train...

On Fri, Aug 24, 2018 at 2:36 PM Felix Bemm notifications@github.com wrote:

Getting:

Traceback (most recent call last): File "/ebio/abt6_projects9/abt6_software/bin/cat/env/lib/python2.7/site-packages/luigi/worker.py", line 205, in run new_deps = self._run_get_new_deps() File "/ebio/abt6_projects9/abt6_software/bin/cat/env/lib/python2.7/site-packages/luigi/worker.py", line 142, in _run_get_new_deps task_gen = self.task.run() File "/ebio/abt6_projects9/abt6_software/bin/cat/cat/init.py", line 801, in run tx_dict = tools.transcripts.get_gene_pred_dict(args.annotation_gp) NameError: global name 'args' is not defined ERROR:luigi-interface:[pid 22225] Worker Worker(salt=517015710, workers=16, host=burrito, username=fbemm, pid=21378) failed Task: Gff3ToAttrs Traceback (most recent call last): File "/ebio/abt6_projects9/abt6_software/bin/cat/env/lib/python2.7/site-packages/luigi/worker.py", line 205, in run new_deps = self._run_get_new_deps() File "/ebio/abt6_projects9/abt6_software/bin/cat/env/lib/python2.7/site-packages/luigi/worker.py", line 142, in _run_get_new_deps task_gen = self.task.run() File "/ebio/abt6_projects9/abt6_software/bin/cat/cat/init.py", line 801, in run tx_dict = tools.transcripts.get_gene_pred_dict(args.annotation_gp) NameError: global name 'args' is not defined

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/105#issuecomment-415888969, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXd6skbBTGOjBqnKmjkg-1sAo6kMQks5uUHH5gaJpZM4Vr9mG .

-- Ian Fiddes, PhD

fbemm commented 6 years ago

Runtime error: Traceback (most recent call last): File "/ebio/abt6/fbemm/.miniconda2/lib/python2.7/site-packages/luigi/worker.py", line 199, in run new_deps = self._run_get_new_deps() File "/ebio/abt6/fbemm/.miniconda2/lib/python2.7/site-packages/luigi/worker.py", line 139, in _run_get_new_deps task_gen = self.task.run() File "/ebio/abt6_projects9/abt6_software/bin/cat/cat/init.py", line 855, in run results.append([gene_id, tx_id, tx_name, gene_name, gene_biotype, tx_biotype]) UnboundLocalError: local variable 'tx_name' referenced before assignment

;)

ifiddes commented 6 years ago

Interesting, I ran the guppy GFF3 through this in a notebook and it worked... let me take a look

On Fri, Aug 24, 2018 at 3:09 PM Felix Bemm notifications@github.com wrote:

Runtime error:

Traceback (most recent call last):

File "/ebio/abt6/fbemm/.miniconda2/lib/python2.7/site-packages/luigi/worker.py", line 199, in run new_deps = self._run_get_new_deps() File "/ebio/abt6/fbemm/.miniconda2/lib/python2.7/site-packages/luigi/worker.py", line 139, in _run_get_new_deps task_gen = self.task.run() File "/ebio/abt6_projects9/abt6_software/bin/cat/cat/init.py", line 855, in run results.append([gene_id, tx_id, tx_name, gene_name, gene_biotype, tx_biotype]) UnboundLocalError: local variable 'tx_name' referenced before assignment

;)

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/105#issuecomment-415895321, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXQ1n7xJ5alEUE764Qx2Sh5SohSP-ks5uUHmXgaJpZM4Vr9mG .

-- Ian Fiddes, PhD

ifiddes commented 6 years ago

Ok try now (I had the indentation different in the notebook... sigh)

On Fri, Aug 24, 2018 at 3:11 PM Ian Fiddes ian.t.fiddes@gmail.com wrote:

Interesting, I ran the guppy GFF3 through this in a notebook and it worked... let me take a look

On Fri, Aug 24, 2018 at 3:09 PM Felix Bemm notifications@github.com wrote:

Runtime error:

Traceback (most recent call last):

File "/ebio/abt6/fbemm/.miniconda2/lib/python2.7/site-packages/luigi/worker.py", line 199, in run new_deps = self._run_get_new_deps() File "/ebio/abt6/fbemm/.miniconda2/lib/python2.7/site-packages/luigi/worker.py", line 139, in _run_get_new_deps task_gen = self.task.run() File "/ebio/abt6_projects9/abt6_software/bin/cat/cat/init.py", line 855, in run results.append([gene_id, tx_id, tx_name, gene_name, gene_biotype, tx_biotype]) UnboundLocalError: local variable 'tx_name' referenced before assignment

;)

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/105#issuecomment-415895321, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXQ1n7xJ5alEUE764Qx2Sh5SohSP-ks5uUHmXgaJpZM4Vr9mG .

-- Ian Fiddes, PhD

-- Ian Fiddes, PhD

fbemm commented 6 years ago

I removed some duplicates back in the early times of CAT. Let me try with the original NCBI GFF3.

fbemm commented 6 years ago

Chaining and generating hints. SO we are running. Thx! I'll keep you updated.

ifiddes commented 6 years ago

There was a bug, I found it. However, you should be able to use the NCBI GFF3 as-is now (as long as your gff3ToGenePred is new enough, Mark fixed the problem sometime in March or so).

On Fri, Aug 24, 2018 at 3:14 PM Felix Bemm notifications@github.com wrote:

I removed some duplicates back in the early times of CAT. Let me try with the original NCBI GFF3.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/105#issuecomment-415896175, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXcQVw87utbu01HhBWg3OjxWswOQdks5uUHqugaJpZM4Vr9mG .

-- Ian Fiddes, PhD

ifiddes commented 6 years ago

Closing issue here, keeping #97 open for now.