ComparativeGenomicsToolkit / Comparative-Annotation-Toolkit

Apache License 2.0
164 stars 48 forks source link

CAT consensus_gene_set output #256

Open francicco opened 3 years ago

francicco commented 3 years ago

Hi,

I'm looking at the and product of CAT, run only using TransMap and excluding the indel classification r.extend(find_indels(tx, psl, aln_mode)).

I'm now trying to build a table of orthologs and paralogs. The gp_info files should contain all info related to it. I didn't find a detailed description of it. It generally seems to be quite intuitive, although I'm a bit confused. The 4th field transcript_class should be the field important to what I'm doing, because it generally contains ortholog, poor_alignment or possible_paralog, but sometimes it contains gene name instead ( eg Eisa.Eisa1Z00G52.1.cg14688).

I'm confused, maybe something went wrong?

The other thing is related to the indel classification. Removing that step produces some bias I should be aware of?

Thanks a lot. F

francicco commented 3 years ago

Yeah... maybe... :/ F

francicco commented 3 years ago

Hi @mhaukness-ucsc

I wanted to share this plot with you:

Screenshot 2021-05-11 at 11 49 02

This plot shows the number of genes per species from the different iterations. Iteration 0 shows the number of genes before I started CAT. It seems like that some species increased their annotated genes after the 1st iteration, while more or less decreased their total amount after the 2nd. Some dropped significantly. The other thing is that the total numbers seem to converge.

I'm not sure if this is expected and a good thing. Any thought? Cheers F

mhaukness-ucsc commented 3 years ago

Hmm, this is a bit unexpected to me, I don't think the number of genes should drop so low after the second iteration. How were you using the results of iteration 1 in the next? What did your config files for CAT look like for each round?

francicco commented 3 years ago

I converted the gff3 into a digestible gff3 format for CAT, checking them with validate_gff3, in the same way I generated the first iteration gff3 file. The conf file is just species = path/to/the/new/gff3. Nothing fancy. F

mhaukness-ucsc commented 3 years ago

I'm not sure what's going on, maybe looking at your data would help. Would it be possible to find an example of a gene in one of your species that was present after the first iteration but lost in the second round? And then share browser screenshots of that region in the genome for both rounds? (With all of the tracks under the "Comparative Annotation Toolkit" section turned on? )

francicco commented 3 years ago

I took a few days off. As soon as I'm back I'll check this. Thanks a lot F

francicco commented 3 years ago

Here some examples:

Screenshot 2021-05-16 at 18 55 04 Screenshot 2021-05-16 at 18 57 07 Screenshot 2021-05-16 at 18 58 32 Screenshot 2021-05-16 at 19 01 00

And many more. F

francicco commented 3 years ago

The other thing I noticed is that the gff3 output of CAT sometimes is not formatted correctly.

In this case for example the stop_codon frature is missing.

Hmel200216o     CAT     transcript      6148    8027    9140    +       .       source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcr
Hmel200216o     CAT     exon    6148    6341    .       +       .       source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o     CAT     CDS     6240    6341    .       +       0       source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o     CAT     start_codon     6240    6242    .       +       0       source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcr
Hmel200216o     CAT     intron  6342    6417    .       +       .       source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o     CAT     exon    6418    6529    .       +       .       source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o     CAT     CDS     6418    6529    .       +       0       source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o     CAT     intron  6530    6597    .       +       .       source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o     CAT     exon    6598    6791    .       +       .       source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o     CAT     CDS     6598    6791    .       +       2       source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o     CAT     intron  6792    7013    .       +       .       source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o     CAT     exon    7014    7242    .       +       .       source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o     CAT     CDS     7014    7242    .       +       0       source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o     CAT     intron  7243    7518    .       +       .       source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o     CAT     exon    7519    7639    .       +       .       source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o     CAT     CDS     7519    7639    .       +       2       source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o     CAT     intron  7640    7742    .       +       .       source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o     CAT     exon    7743    8027    .       +       .       source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o     CAT     CDS     7743    7920    .       +       1       source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode

This created a CDS error in one of my script. I'm not sure if this is the source of the problem. I'll fix it and rerun it.

What bothers me a bit is that the gff3 output of CAT can't be used as input in the following iterations, therefore I have to reajust it.

Cheers F

mhaukness-ucsc commented 3 years ago

Thanks for providing these examples! This is helpful for trying to improve the features of CAT. There are a few things going on here...

1) For the examples that have something in the Filtered transMap track but not the CAT Annotation, I think there is a bug. I encountered something similar in a run of my own a couple weeks ago, so I have opened an issue to track this (#259) and I will figure out what's going on.

2) Do you have the --filter-overlapping-genes flag on like before? If so it might be filtering out useful things, like when a tiny portion of a gene is caught by transMap and that is chosen over the full-length annotation from the external reference. One option could be to try rerunning the pipeline (at least the second iteration) with this flag off and see if it helps? It would have the same issues as before with the overlapping genes but it might save a good amount of transcripts. Then you could do a final filtering to get rid of the overlapping genes at the very end of all your rounds.

3) In general there are some things I want to improve about how an external reference is used. For your set specifically I think it would be good to prefer the long external reference annotations over the short transMaps (if you don't allow for overlapping genes). I'd also like it to track gene IDs better. One other thing I'd like to verify is if the short transMaps correspond to the long external reference. If you look at the first example, is the short transcript in the filtered transMap the same gene as the one in the external reference?

4) For the gff3, I'm guessing the transcript doesn't have a valid stop codon? So the start codon makes it into the gff3 but not the stop. I can see that there aren't equal numbers of start and stop codons in some other gff3s I have. I can't think of a good reason to change this in the gff3 -- maybe cases like this can be handled in the downstream script? However, I do agree that the output of CAT should be valid input -- I have another issue for this (#241) (oh and also #246) and I'll prioritize fixing this too. Have you observed any additional issues with the gff3? (Other than the some records not having gene_name and transcript_name like I mentioned in that issue)

I'll let you know if/when I have fixes for the bugs mentioned!

francicco commented 3 years ago

Thanks a lot @mhaukness-ucsc!

I also have indels:

Screenshot 2021-05-17 at 20 16 53

Which are really hard to handle. I don't know what to do with these. A better option would be to not having these the user doesn't want them. I could try to skip pseudo-introns (indels ) assuming they are smaller than a X (?).

Also, some transcripts don't have a start_codon and the beginning of the CDS coincides with the beginning of the script, which sometimes does not correspond with the first codon position of the real reading frame.

Screenshot 2021-05-17 at 20 24 12

Yes, --filter-overlapping-genes is on.

How should I procede? Thanks a lot for your precious help! F

diekhans commented 3 years ago

I also have indels:

The intention of Augustus is convert those kind gaps caused by evolutionary changes into valid gene models.

mhaukness-ucsc commented 3 years ago

Yes, running Augustus would help improve these gene models. I know you've tried AugustusPB and AugustusCGP, did you ever try AugustusTM?

This looks like a case where the genome that was used to make your external reference was better than the current reference being used, but since you have filter-overlapping-genes on, the transMap gets chosen (even though the external reference has a better alignment).

In the meantime while I work on the other bugs mentioned, you could try turning that flag off, as well as running CAT with AugustusTM on (if you're able to get it working, it should be easier than the other Augustus types).

francicco commented 3 years ago

Ok, I'll run --AugustusTM, and see what happens. I can add it to a finished job, right? F

mhaukness-ucsc commented 3 years ago

You may be able to add the --augustus flag to your command and have it use some of the intermediate work files, but it will have to run a lot of things over again.

francicco commented 3 years ago

--augustus is for AugustusTM?

mhaukness-ucsc commented 3 years ago

yes!

francicco commented 3 years ago

The run seems to be over, but for some reason the Augustus track is not present in the hub. Here a screen shot for the locus: Hmel200210o:16,084-30,771, indels seem to be still there.

Screenshot 2021-05-18 at 10 09 10

Here a close up:

Screenshot 2021-05-18 at 10 12 38

I could run it from scratch, but it'll take 4,5 days... F

mhaukness-ucsc commented 3 years ago

It looks like augustusTM did run, you can see the gene models it predicted in the augTM track. The predictions might not have been better than TransMap to make it into the final consensus set. You can check how many transcripts were from augustusTM by checking how many have augTM as a transcript mode, and this data is also visible in the output plot transcript_modes.pdf. Can you share that plot?

The indels are still coming from the transMap (the new reference being used in this round is probably not as great of a match for this species as the last one). I think that your external reference should be incorporated into the final annotation. Is filter-overlapping-genes still on? If so, could you try this again without it and see if those predictions make it in?

francicco commented 3 years ago

I added the augTM manually

francicco commented 3 years ago
Screenshot 2021-05-18 at 20 52 02

transcript_modes.pdf

is there anywhere I can change the size of the labels?

francicco commented 3 years ago

filter-overlapping-genes is still on. I'll try without

The indels are still coming from the transMap (the new reference being used in this round is probably not as great of a match for this species as the last one). I think that your external reference should be incorporated into the final annotation.

I didn't get this point F

mhaukness-ucsc commented 3 years ago

Oh hmm, I don't know why the AugustusTM track wouldn't be showing up in the hub (the hub should rebuild during your run... try reloading it from scratch into the browser?) But at least you can see the amount of green TM transcripts in the plot. There is no easy way to change the labels but it would be good to auto-adjust the size of the labels based on the number of genomes you have, I'll try to work that into a future update.

As for the indels, I was trying to say that the indel is likely "real" relative to the annotation in whatever species you are using as a reference. So it shows up in the transMap alignment, and because filter-overlapping-genes is on, only one gene can make it into the annotation, which is the one with the indel. So my idea is that if that flag is turned off, things from the external reference will also be added to the annotation. You'd have the gene from transMap with the indel, and the gene from the external reference without. Then at the end of your rounds of annotation, all of the conflicts could be resolved to select only the best gene for each locus.

francicco commented 3 years ago

Here without filter-overlapping-genes link

and here a screenshot:

Screenshot 2021-05-18 at 21 49 48

The indels are still there:

Screenshot 2021-05-18 at 21 53 48

and the new plot transcript_modes.pdf

Screenshot 2021-05-18 at 21 51 02
francicco commented 3 years ago

One way to manage the indels is to remove them, generating a continuous exon, and annotate the CDS on it

This is the same indel treated in this way:

Screenshot 2021-05-18 at 21 56 31

In black the external_reference, in light blue the reannotated CDS

here a zoom out:

Screenshot 2021-05-18 at 21 58 37

I just don't if that would be biologically correct. F

mhaukness-ucsc commented 3 years ago

Thanks for the browser link, I looked through it and it helped me get a better understanding of the types of situations that are popping up. This particular example does look like a tricky case, because the external reference doesn't have the indel, but it does have a different splice the exon before. Maybe that is causing the externalReference scores to be worse so they don't end up getting incorporated.

I'll trace through this one particular example to see what is happening with the scoring and why CAT chooses to not incorporate the predictions from the external reference. Maybe there will be a fix I can make to solve this. (One idea is to add an option to treat the existing annotations for the references that have them preferentially over TransMap?) I've made a couple small changes to the code on another branch, so I'll keep adding to it and have you test it out once I think it will actually improve your results.

It's hard to say what is correct in regards to the biology. Most likely this frameshift isn't real (it causes an in frame premature stop!), and maybe that's because the assembly has an error here. Maybe reannotating the CDS like you showed is the best option...

francicco commented 3 years ago

Here other messy loci GB

Screenshot 2021-05-19 at 16 15 45

GB

Screenshot 2021-05-19 at 16 18 05

GB

Screenshot 2021-05-19 at 16 27 55
francicco commented 3 years ago

This is another example where the transmap is favored against the reference

Screenshot 2021-05-20 at 11 53 11

When actually it should not. F

mhaukness-ucsc commented 3 years ago

One thing that I think might be the source of some of these problems is the format of the external reference gff3 files. In a lot of these examples, it looks like you have different isoforms of the same gene. However, they all have different gene IDs (for example, Hmel.Hmel200210oG1.1 and Hmel.Hmel200210oG1.2). If they are from the same gene, the gene IDs should be the same, and only the transcript IDs should differ.

Is this the case in the original annotations for the subset of genomes? How did you make those? I think if those are changed so that isoforms of the same gene all share the same gene ID, it will help. (Unless they are paralogs in different locations, then the gene IDs should be different!)

mhaukness-ucsc commented 3 years ago

Hey @francicco, I just added some of the changes discussed to the "enhancements" branch. Next time you run CAT, could you try switching to that branch and running CAT from there? I don't think this will fix all of the problems (especially if the gene IDs for the isoforms stay the same) but I think it will help some problems. At the very least, the CAT gff3 should now be usable as input to your next runs without further modification.

Let me know if you need any help getting the branch to run, and if you encounter any errors along the way.

francicco commented 3 years ago

Some updates on my side:

This is how I'm currently executing CAT

luigi --module cat RunCat \
        --hal=$OUTPUTDIR/$OUTHAL.hal --ref-genome=$CATREFERENCE --workers=$THREADS \
        --config=$OUTPUTDIR/$CONFIG.$CATREFERENCE.CAT.conf --binary-mode=local \
        --work-dir $CATOUTDIR --workDir $CATOUTDIR  --filter-overlapping-genes \
        --out-dir $CATOUTDIR --disableCaching --local-scheduler --rebuild-consensus \
        --augustus --augustus-species Heliconius_melpomene2.5 $EXTRAOPTIONS \
        --assembly-hub --rebuild-consensus --cleanWorkDir=never --augustus

And these are the number of loci per species after the first iteration:

Screenshot 2021-05-21 at 10 08 50

It looks relatively good apart from one species, an outgroup for which the annotation is not mine. For that species the #loci goes from 23335 to 16188. I'm running the second iteration to check how the trend goes.

Let me know what I can test the new branch.

Cheers F

francicco commented 3 years ago

Considering the BUSCO genes, measuring only the missing genes, comparing the input with the first iteration CAT output, 44 out of the 63 species have increased missing genes, while only 14 improve (lower their number)., 5 stay the same. This is not so good. :(

F

francicco commented 3 years ago

I'll now try the "enhancements" branch Thanks a lot F

francicco commented 3 years ago

The results are exactly the same. F

mhaukness-ucsc commented 3 years ago

Hmm, that's a bit surprising. Do the new gff3 files work as valid input for CAT, at least? (Do they pass the validate_gff3 script under the programs directory?) If you have an assembly hub link, I can look at some of the same examples from before, too.

I'd try without including the filter-overlapping-genes flag. I also changed a part of the filter-transmap stage, so to see the effects of that change you would have to rerun from the beginning (not just rebuild-consensus). I'm not sure how much that would affect your results, but it helped increase the number of paralogs found in some of my runs on humans.

I'd try again with a command like this?

luigi --module cat RunCat \ --hal=$OUTPUTDIR/$OUTHAL.hal --ref-genome=$CATREFERENCE --workers=$THREADS \ --config=$OUTPUTDIR/$CONFIG.$CATREFERENCE.CAT.conf --binary-mode=local \ --work-dir $CATOUTDIR --workDir $CATOUTDIR \ --out-dir $CATOUTDIR --disableCaching --local-scheduler \ --augustus --augustus-species Heliconius_melpomene2.5 $EXTRAOPTIONS \ --assembly-hub --cleanWorkDir=never

francicco commented 3 years ago

I executed the command you sent me... but apparently is not rewriting the consensus files...

WARNING: 2021-05-21 18:03:22,668 - No extrinsic data found in config. Will load genomes and annotation only.
WARNING: 2021-05-21 18:03:55,750 - No extrinsic data found in config. Will load genomes and annotation only.
INFO: 2021-05-21 18:04:28,795 - Informed scheduler that task   RunCat_False_True_True_ff0ead534b   has status   DONE
INFO: 2021-05-21 18:04:28,795 - Done scheduling tasks
INFO: 2021-05-21 18:04:28,795 - Running Worker with 64 processes
INFO: 2021-05-21 18:04:28,796 - Worker Worker(salt=266894297, workers=64, host=bp1-login01.data.bp.acrc.priv, username=tk19812, pid=70070) was stopped. Shutting down Keep-Alive thread
INFO: 2021-05-21 18:04:28,797 - 
===== Luigi Execution Summary =====

Scheduled 1 tasks of which:
* 1 complete ones were encountered:
    - 1 RunCat(...)

Did not run any tasks
This progress looks :) because there were no failed tasks or missing dependencies

===== Luigi Execution Summary =====

F

mhaukness-ucsc commented 3 years ago

Ah, yeah it looks like it saw that everything was already finished running from your previous run. If you're willing to run things from scratch you could give it a new work-dir / out-dir and it would start from the beginning. You could try annotating only a small subset of the genomes so it doesn't take long to run, just while we are getting this figured out!

francicco commented 3 years ago

But then I guess I'd need a new genome alignment... right? F

mhaukness-ucsc commented 3 years ago

You don't need a new alignment, you can use the one you have! CAT can annotate a subset of the genomes present in the alignment with the --target-genomes parameter (so you could pass in something like --target-genomes='("Hmel","Eisa","Dpha")', or whichever other genomes you choose)

francicco commented 3 years ago

Cool! Didn't know that F

francicco commented 3 years ago

I'm getting this:

ERROR: 2021-05-21 19:24:44,601 - Got exit code 1 (indicating failure) from job _toil_worker JobFunctionWrappingJob file:/work/tk19812/HeliconiniiProject/HeliconGenomeAlignmentAnnotation/63Nymphalidae.Cactus.Cactus.Eisa.withCAT.Eisa.TransMap.outDir/toil/chaining/jobStore kind-JobFunctionWrappingJob/W/instance-mpb_6bpk.
WARNING: 2021-05-21 19:24:44,601 - Job failed with exit value 1: 'JobFunctionWrappingJob' kind-JobFunctionWrappingJob/W/instance-mpb_6bpk
WARNING: 2021-05-21 19:24:44,604 - The job seems to have left a log file, indicating failure: 'JobFunctionWrappingJob' kind-JobFunctionWrappingJob/W/instance-mpb_6bpk
WARNING: 2021-05-21 19:24:44,604 - Log from job kind-JobFunctionWrappingJob/W/instance-mpb_6bpk follows:
=========>
        [2021-05-21T19:24:43+0100] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
        [2021-05-21T19:24:43+0100] [MainThread] [I] [toil] Running Toil version 5.0.0-f182c6420554b258632a40bfa47a8f69e56675e4 on host bp1-compute00194.data.bp.acrc.priv.
        [2021-05-21T19:24:43+0100] [MainThread] [I] [toil.worker] Working on job 'JobFunctionWrappingJob' kind-JobFunctionWrappingJob/W/instance-mpb_6bpk
        [2021-05-21T19:24:43+0100] [MainThread] [I] [luigi-interface] Loaded ['luigi.cfg']
        [2021-05-21T19:24:44+0100] [MainThread] [I] [toil.worker] Loaded body Job('JobFunctionWrappingJob' kind-JobFunctionWrappingJob/W/instance-mpb_6bpk) from description 'JobFunctionWrappingJob' kind-JobFunctionWrappingJob/W/instance-mpb_6bpk
        Traceback (most recent call last):
          File "/work/tk19812/software/Comparative-Annotation-Toolkit/venv/lib/python3.8/site-packages/toil/worker.py", line 380, in workerScript
            with fileStore.open(job):
          File "/sw/lang/anaconda.3.8-2020.07/lib/python3.8/contextlib.py", line 113, in __enter__
            return next(self.gen)
          File "/work/tk19812/software/Comparative-Annotation-Toolkit/venv/lib/python3.8/site-packages/toil/fileStores/nonCachingFileStore.py", line 54, in open
            self._removeDeadJobs(self.workDir)
          File "/work/tk19812/software/Comparative-Annotation-Toolkit/venv/lib/python3.8/site-packages/toil/fileStores/nonCachingFileStore.py", line 186, in _removeDeadJobs
            if not process_name_exists(nodeInfo, jobState['jobProcessName']):
          File "/work/tk19812/software/Comparative-Annotation-Toolkit/venv/lib/python3.8/site-packages/toil/lib/threading.py", line 280, in process_name_exists
            nameFD = os.open(nameFileName, os.O_RDONLY)
        FileNotFoundError: [Errno 2] No such file or directory: '/work/tk19812/HeliconiniiProject/HeliconGenomeAlignmentAnnotation/63Nymphalidae.Cactus.Cactus.Eisa.withCAT.Eisa.TransMap.outDir/node-ea937378-b401-4cca-b439-5306eb59a1f1-4bf74fb8-097a-4c81-8fa8-dbc02029071e/tmpbo_gizxw'
        [2021-05-21T19:24:44+0100] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host bp1-compute00194.data.bp.acrc.priv
<=========
francicco commented 3 years ago
WARNING: 2021-05-22 10:02:43,026 - Will not run Task: FindDenovoParents for exRef or any dependencies due to error in complete() method:
Traceback (most recent call last):
  File "/work/tk19812/software/Comparative-Annotation-Toolkit/venv/lib/python3.8/site-packages/luigi/worker.py", line 401, in check_complete
    is_complete = task.complete()
  File "/work/tk19812/software/Comparative-Annotation-Toolkit/venv/lib/python3.8/site-packages/luigi/task.py", line 563, in complete
    outputs = flatten(self.output())
  File "/work/tk19812/software/Comparative-Annotation-Toolkit/venv/lib/python3.8/site-packages/luigi/task.py", line 883, in flatten
    for result in iterator:
  File "/work/tk19812/software/Comparative-Annotation-Toolkit/cat/__init__.py", line 1697, in output
    denovo_args = FindDenovoParents.get_args(pipeline_args, self.mode)
  File "/work/tk19812/software/Comparative-Annotation-Toolkit/cat/__init__.py", line 1661, in get_args
    filtered_tm_gp_files = {genome: TransMap.get_args(pipeline_args, genome).filtered_tm_gp
  File "/work/tk19812/software/Comparative-Annotation-Toolkit/cat/__init__.py", line 1661, in <dictcomp>
    filtered_tm_gp_files = {genome: TransMap.get_args(pipeline_args, genome).filtered_tm_gp
  File "/work/tk19812/software/Comparative-Annotation-Toolkit/cat/__init__.py", line 1200, in get_args
    args.chain_file = Chaining.get_args(pipeline_args).chain_files[genome]
KeyError: 'Hher'
===== Luigi Execution Summary =====

Scheduled 70 tasks of which:
* 12 complete ones were encountered:
    - 1 BuildDb(...)
    - 1 Chaining(...)
    - 2 FilterTransMap(...)
    - 1 GenomeFiles(...)
    - 1 PrepareFiles(...)
    ...
* 21 ran successfully:
    - 3 AlignTranscriptDriverTask(...)
    - 1 AlignTranscripts(...)
    - 1 Augustus(...)
    - 3 AugustusDriverTask(...)
    - 3 EvaluateDriverTask(...)
    ...
* 1 failed scheduling:
    - 1 FindDenovoParents(...)
* 36 were left pending, among these:
    * 4 had dependencies whose scheduling failed:
        - 3 ConsensusDriverTask(...)
        - 1 HgmDriverTask(...)
    * 32 was not granted run permission by the scheduler:
        - 1 AssemblyHub(...)
        - 7 BgpTrack(...)
        - 1 Consensus(...)
        - 3 ConsensusTrack(...)
        - 1 CreateDirectoryStructure(...)
        ...

This progress looks :( because there were tasks whose scheduling failed

===== Luigi Execution Summary =====
mhaukness-ucsc commented 3 years ago

Not sure immediately what the problem is, what was your command for CAT? Was Hher one of the genomes you are annotating in your subset or not? Did you change the config file at all? Maybe try removing the annotations for the genomes you are not using from the config?

francicco commented 3 years ago

The command line was the one you suggested me, Hher wasn't in the target genomes. I can try changing the conf file. F

francicco commented 3 years ago

Ok, Done.

I see some improvement, but still... [Hmel201001o:3742294-3764583]

Screenshot 2021-05-24 at 12 15 34

[Hmel201001o:3,952,947-3,960,869]

Screenshot 2021-05-24 at 12 19 08

[Hmel201001o:3,983,949-3,996,848]

Screenshot 2021-05-24 at 12 21 39

Indels are still there... F

mhaukness-ucsc commented 3 years ago

These results do look quite a bit better to me, at least in these examples. The indels in the transMap are getting cleaned up in the consensus CAT annotation. Do you think this look good enough to continue on with the rest of the genomes?

francicco commented 3 years ago

Yes, it looks good enough. There's almost no gene left from the external_reference annotation. I'm currently running the rest of the genomes.

I don't quite understand when the indels will be removed, I'm in your hands about that.

What's left now? Do you want to implement a better selection between transMap and the external_reference

Thanks a lot! F

francicco commented 3 years ago

In this plot you can see the level of gene completeness. The x-axis is that percentage of the target hit is aligned (100% means the gene is putatively complete) against Uniprot. Dashed line external_reference, continuous line the CAT annotation. You can see how CAT generally does not select the best annotated genes. The curves are lower compared with the external_reference.

Screenshot 2021-05-24 at 9 40 03 PM

Uniprot.AllspeciesCATcomparison.outfmt6.w_pct_hit_length.pdf

F

mhaukness-ucsc commented 3 years ago

I have some ideas for improving the selection between transMap and external reference that I'm working on that should improve the gene completeness. Essentially I want to add an option to tell CAT to prefer the external reference transcripts over transMap, and you would turn it on in cases like yours where the external reference is an annotation from the same species and the reference annotation being transMapped is from a different species. I'll let you know if/when I get something working that I'm happy with!

francicco commented 3 years ago

Wouldn't be possible to apply some kind of score and select the best 2 or 3? F

mhaukness-ucsc commented 3 years ago

Kind of, the problem is making sure the parent gene/transcript assignments make sense. There can't be multiple things with the same transcriptID...

One other flag you can try is adding the flag --denovo-allow-bad-annot-or-tm which will add additional predictions that overlapped multiple genes. You can add this with the rebuild-consensus option (so you don't need to run the whole pipeline from scratch).