Open francicco opened 3 years ago
Yeah... maybe... :/ F
Hi @mhaukness-ucsc
I wanted to share this plot with you:
This plot shows the number of genes per species from the different iterations. Iteration 0 shows the number of genes before I started CAT. It seems like that some species increased their annotated genes after the 1st iteration, while more or less decreased their total amount after the 2nd. Some dropped significantly. The other thing is that the total numbers seem to converge.
I'm not sure if this is expected and a good thing. Any thought? Cheers F
Hmm, this is a bit unexpected to me, I don't think the number of genes should drop so low after the second iteration. How were you using the results of iteration 1 in the next? What did your config files for CAT look like for each round?
I converted the gff3 into a digestible gff3 format for CAT, checking them with validate_gff3, in the same way I generated the first iteration gff3 file. The conf file is just species = path/to/the/new/gff3. Nothing fancy. F
I'm not sure what's going on, maybe looking at your data would help. Would it be possible to find an example of a gene in one of your species that was present after the first iteration but lost in the second round? And then share browser screenshots of that region in the genome for both rounds? (With all of the tracks under the "Comparative Annotation Toolkit" section turned on? )
I took a few days off. As soon as I'm back I'll check this. Thanks a lot F
Here some examples:
And many more. F
The other thing I noticed is that the gff3 output of CAT sometimes is not formatted correctly.
In this case for example the stop_codon
frature is missing.
Hmel200216o CAT transcript 6148 8027 9140 + . source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcr
Hmel200216o CAT exon 6148 6341 . + . source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o CAT CDS 6240 6341 . + 0 source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o CAT start_codon 6240 6242 . + 0 source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcr
Hmel200216o CAT intron 6342 6417 . + . source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o CAT exon 6418 6529 . + . source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o CAT CDS 6418 6529 . + 0 source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o CAT intron 6530 6597 . + . source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o CAT exon 6598 6791 . + . source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o CAT CDS 6598 6791 . + 2 source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o CAT intron 6792 7013 . + . source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o CAT exon 7014 7242 . + . source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o CAT CDS 7014 7242 . + 0 source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o CAT intron 7243 7518 . + . source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o CAT exon 7519 7639 . + . source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o CAT CDS 7519 7639 . + 2 source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o CAT intron 7640 7742 . + . source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o CAT exon 7743 8027 . + . source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
Hmel200216o CAT CDS 7743 7920 . + 1 source_transcript=Eisa.Eisa1800G430.1;source_transcript_name=Eisa.Eisa1800G430.1.zfpl1;source_gene=Eisa.Eisa1800G430;transcript_mode
This created a CDS error in one of my script. I'm not sure if this is the source of the problem. I'll fix it and rerun it.
What bothers me a bit is that the gff3 output of CAT can't be used as input in the following iterations, therefore I have to reajust it.
Cheers F
Thanks for providing these examples! This is helpful for trying to improve the features of CAT. There are a few things going on here...
1) For the examples that have something in the Filtered transMap track but not the CAT Annotation, I think there is a bug. I encountered something similar in a run of my own a couple weeks ago, so I have opened an issue to track this (#259) and I will figure out what's going on.
2) Do you have the --filter-overlapping-genes
flag on like before? If so it might be filtering out useful things, like when a tiny portion of a gene is caught by transMap and that is chosen over the full-length annotation from the external reference. One option could be to try rerunning the pipeline (at least the second iteration) with this flag off and see if it helps? It would have the same issues as before with the overlapping genes but it might save a good amount of transcripts. Then you could do a final filtering to get rid of the overlapping genes at the very end of all your rounds.
3) In general there are some things I want to improve about how an external reference is used. For your set specifically I think it would be good to prefer the long external reference annotations over the short transMaps (if you don't allow for overlapping genes). I'd also like it to track gene IDs better. One other thing I'd like to verify is if the short transMaps correspond to the long external reference. If you look at the first example, is the short transcript in the filtered transMap the same gene as the one in the external reference?
4) For the gff3, I'm guessing the transcript doesn't have a valid stop codon? So the start codon makes it into the gff3 but not the stop. I can see that there aren't equal numbers of start and stop codons in some other gff3s I have. I can't think of a good reason to change this in the gff3 -- maybe cases like this can be handled in the downstream script? However, I do agree that the output of CAT should be valid input -- I have another issue for this (#241) (oh and also #246) and I'll prioritize fixing this too. Have you observed any additional issues with the gff3? (Other than the some records not having gene_name
and transcript_name
like I mentioned in that issue)
I'll let you know if/when I have fixes for the bugs mentioned!
Thanks a lot @mhaukness-ucsc!
I also have indels:
Which are really hard to handle. I don't know what to do with these. A better option would be to not having these the user doesn't want them. I could try to skip pseudo-introns (indels ) assuming they are smaller than a X (?).
Also, some transcripts don't have a start_codon
and the beginning of the CDS coincides with the beginning of the script, which sometimes does not correspond with the first codon position of the real reading frame.
Yes, --filter-overlapping-genes
is on.
How should I procede? Thanks a lot for your precious help! F
I also have indels:
The intention of Augustus is convert those kind gaps caused by evolutionary changes into valid gene models.
Yes, running Augustus would help improve these gene models. I know you've tried AugustusPB and AugustusCGP, did you ever try AugustusTM?
This looks like a case where the genome that was used to make your external reference was better than the current reference being used, but since you have filter-overlapping-genes
on, the transMap gets chosen (even though the external reference has a better alignment).
In the meantime while I work on the other bugs mentioned, you could try turning that flag off, as well as running CAT with AugustusTM on (if you're able to get it working, it should be easier than the other Augustus types).
Ok, I'll run --AugustusTM
, and see what happens.
I can add it to a finished job, right?
F
You may be able to add the --augustus
flag to your command and have it use some of the intermediate work files, but it will have to run a lot of things over again.
--augustus
is for AugustusTM?
yes!
The run seems to be over, but for some reason the Augustus track is not present in the hub.
Here a screen shot for the locus: Hmel200210o:16,084-30,771
, indels seem to be still there.
Here a close up:
I could run it from scratch, but it'll take 4,5 days... F
It looks like augustusTM did run, you can see the gene models it predicted in the augTM track. The predictions might not have been better than TransMap to make it into the final consensus set. You can check how many transcripts were from augustusTM by checking how many have augTM
as a transcript mode, and this data is also visible in the output plot transcript_modes.pdf
. Can you share that plot?
The indels are still coming from the transMap (the new reference being used in this round is probably not as great of a match for this species as the last one). I think that your external reference should be incorporated into the final annotation. Is filter-overlapping-genes
still on? If so, could you try this again without it and see if those predictions make it in?
I added the augTM
manually
is there anywhere I can change the size of the labels?
filter-overlapping-genes
is still on. I'll try without
The indels are still coming from the transMap (the new reference being used in this round is probably not as great of a match for this species as the last one). I think that your external reference should be incorporated into the final annotation.
I didn't get this point F
Oh hmm, I don't know why the AugustusTM track wouldn't be showing up in the hub (the hub should rebuild during your run... try reloading it from scratch into the browser?) But at least you can see the amount of green TM
transcripts in the plot. There is no easy way to change the labels but it would be good to auto-adjust the size of the labels based on the number of genomes you have, I'll try to work that into a future update.
As for the indels, I was trying to say that the indel is likely "real" relative to the annotation in whatever species you are using as a reference. So it shows up in the transMap alignment, and because filter-overlapping-genes
is on, only one gene can make it into the annotation, which is the one with the indel. So my idea is that if that flag is turned off, things from the external reference will also be added to the annotation. You'd have the gene from transMap with the indel, and the gene from the external reference without. Then at the end of your rounds of annotation, all of the conflicts could be resolved to select only the best gene for each locus.
Here without filter-overlapping-genes
link
and here a screenshot:
The indels are still there:
and the new plot transcript_modes.pdf
One way to manage the indels is to remove them, generating a continuous exon, and annotate the CDS on it
This is the same indel treated in this way:
In black the external_reference, in light blue the reannotated CDS
here a zoom out:
I just don't if that would be biologically correct. F
Thanks for the browser link, I looked through it and it helped me get a better understanding of the types of situations that are popping up. This particular example does look like a tricky case, because the external reference doesn't have the indel, but it does have a different splice the exon before. Maybe that is causing the externalReference scores to be worse so they don't end up getting incorporated.
I'll trace through this one particular example to see what is happening with the scoring and why CAT chooses to not incorporate the predictions from the external reference. Maybe there will be a fix I can make to solve this. (One idea is to add an option to treat the existing annotations for the references that have them preferentially over TransMap?) I've made a couple small changes to the code on another branch, so I'll keep adding to it and have you test it out once I think it will actually improve your results.
It's hard to say what is correct in regards to the biology. Most likely this frameshift isn't real (it causes an in frame premature stop!), and maybe that's because the assembly has an error here. Maybe reannotating the CDS like you showed is the best option...
This is another example where the transmap is favored against the reference
When actually it should not. F
One thing that I think might be the source of some of these problems is the format of the external reference gff3 files. In a lot of these examples, it looks like you have different isoforms of the same gene. However, they all have different gene IDs (for example, Hmel.Hmel200210oG1.1 and Hmel.Hmel200210oG1.2). If they are from the same gene, the gene IDs should be the same, and only the transcript IDs should differ.
Is this the case in the original annotations for the subset of genomes? How did you make those? I think if those are changed so that isoforms of the same gene all share the same gene ID, it will help. (Unless they are paralogs in different locations, then the gene IDs should be different!)
Hey @francicco, I just added some of the changes discussed to the "enhancements" branch. Next time you run CAT, could you try switching to that branch and running CAT from there? I don't think this will fix all of the problems (especially if the gene IDs for the isoforms stay the same) but I think it will help some problems. At the very least, the CAT gff3 should now be usable as input to your next runs without further modification.
Let me know if you need any help getting the branch to run, and if you encounter any errors along the way.
Some updates on my side:
This is how I'm currently executing CAT
luigi --module cat RunCat \
--hal=$OUTPUTDIR/$OUTHAL.hal --ref-genome=$CATREFERENCE --workers=$THREADS \
--config=$OUTPUTDIR/$CONFIG.$CATREFERENCE.CAT.conf --binary-mode=local \
--work-dir $CATOUTDIR --workDir $CATOUTDIR --filter-overlapping-genes \
--out-dir $CATOUTDIR --disableCaching --local-scheduler --rebuild-consensus \
--augustus --augustus-species Heliconius_melpomene2.5 $EXTRAOPTIONS \
--assembly-hub --rebuild-consensus --cleanWorkDir=never --augustus
And these are the number of loci per species after the first iteration:
It looks relatively good apart from one species, an outgroup for which the annotation is not mine. For that species the #loci goes from 23335 to 16188. I'm running the second iteration to check how the trend goes.
Let me know what I can test the new branch.
Cheers F
Considering the BUSCO genes, measuring only the missing genes, comparing the input with the first iteration CAT output, 44 out of the 63 species have increased missing genes, while only 14 improve (lower their number)., 5 stay the same. This is not so good. :(
F
I'll now try the "enhancements" branch Thanks a lot F
The results are exactly the same. F
Hmm, that's a bit surprising. Do the new gff3 files work as valid input for CAT, at least? (Do they pass the validate_gff3
script under the programs
directory?) If you have an assembly hub link, I can look at some of the same examples from before, too.
I'd try without including the filter-overlapping-genes
flag. I also changed a part of the filter-transmap
stage, so to see the effects of that change you would have to rerun from the beginning (not just rebuild-consensus
). I'm not sure how much that would affect your results, but it helped increase the number of paralogs found in some of my runs on humans.
I'd try again with a command like this?
luigi --module cat RunCat \ --hal=$OUTPUTDIR/$OUTHAL.hal --ref-genome=$CATREFERENCE --workers=$THREADS \ --config=$OUTPUTDIR/$CONFIG.$CATREFERENCE.CAT.conf --binary-mode=local \ --work-dir $CATOUTDIR --workDir $CATOUTDIR \ --out-dir $CATOUTDIR --disableCaching --local-scheduler \ --augustus --augustus-species Heliconius_melpomene2.5 $EXTRAOPTIONS \ --assembly-hub --cleanWorkDir=never
I executed the command you sent me... but apparently is not rewriting the consensus files...
WARNING: 2021-05-21 18:03:22,668 - No extrinsic data found in config. Will load genomes and annotation only.
WARNING: 2021-05-21 18:03:55,750 - No extrinsic data found in config. Will load genomes and annotation only.
INFO: 2021-05-21 18:04:28,795 - Informed scheduler that task RunCat_False_True_True_ff0ead534b has status DONE
INFO: 2021-05-21 18:04:28,795 - Done scheduling tasks
INFO: 2021-05-21 18:04:28,795 - Running Worker with 64 processes
INFO: 2021-05-21 18:04:28,796 - Worker Worker(salt=266894297, workers=64, host=bp1-login01.data.bp.acrc.priv, username=tk19812, pid=70070) was stopped. Shutting down Keep-Alive thread
INFO: 2021-05-21 18:04:28,797 -
===== Luigi Execution Summary =====
Scheduled 1 tasks of which:
* 1 complete ones were encountered:
- 1 RunCat(...)
Did not run any tasks
This progress looks :) because there were no failed tasks or missing dependencies
===== Luigi Execution Summary =====
F
Ah, yeah it looks like it saw that everything was already finished running from your previous run. If you're willing to run things from scratch you could give it a new work-dir / out-dir and it would start from the beginning. You could try annotating only a small subset of the genomes so it doesn't take long to run, just while we are getting this figured out!
But then I guess I'd need a new genome alignment... right? F
You don't need a new alignment, you can use the one you have! CAT can annotate a subset of the genomes present in the alignment with the --target-genomes
parameter (so you could pass in something like --target-genomes='("Hmel","Eisa","Dpha")'
, or whichever other genomes you choose)
Cool! Didn't know that F
I'm getting this:
ERROR: 2021-05-21 19:24:44,601 - Got exit code 1 (indicating failure) from job _toil_worker JobFunctionWrappingJob file:/work/tk19812/HeliconiniiProject/HeliconGenomeAlignmentAnnotation/63Nymphalidae.Cactus.Cactus.Eisa.withCAT.Eisa.TransMap.outDir/toil/chaining/jobStore kind-JobFunctionWrappingJob/W/instance-mpb_6bpk.
WARNING: 2021-05-21 19:24:44,601 - Job failed with exit value 1: 'JobFunctionWrappingJob' kind-JobFunctionWrappingJob/W/instance-mpb_6bpk
WARNING: 2021-05-21 19:24:44,604 - The job seems to have left a log file, indicating failure: 'JobFunctionWrappingJob' kind-JobFunctionWrappingJob/W/instance-mpb_6bpk
WARNING: 2021-05-21 19:24:44,604 - Log from job kind-JobFunctionWrappingJob/W/instance-mpb_6bpk follows:
=========>
[2021-05-21T19:24:43+0100] [MainThread] [I] [toil.worker] ---TOIL WORKER OUTPUT LOG---
[2021-05-21T19:24:43+0100] [MainThread] [I] [toil] Running Toil version 5.0.0-f182c6420554b258632a40bfa47a8f69e56675e4 on host bp1-compute00194.data.bp.acrc.priv.
[2021-05-21T19:24:43+0100] [MainThread] [I] [toil.worker] Working on job 'JobFunctionWrappingJob' kind-JobFunctionWrappingJob/W/instance-mpb_6bpk
[2021-05-21T19:24:43+0100] [MainThread] [I] [luigi-interface] Loaded ['luigi.cfg']
[2021-05-21T19:24:44+0100] [MainThread] [I] [toil.worker] Loaded body Job('JobFunctionWrappingJob' kind-JobFunctionWrappingJob/W/instance-mpb_6bpk) from description 'JobFunctionWrappingJob' kind-JobFunctionWrappingJob/W/instance-mpb_6bpk
Traceback (most recent call last):
File "/work/tk19812/software/Comparative-Annotation-Toolkit/venv/lib/python3.8/site-packages/toil/worker.py", line 380, in workerScript
with fileStore.open(job):
File "/sw/lang/anaconda.3.8-2020.07/lib/python3.8/contextlib.py", line 113, in __enter__
return next(self.gen)
File "/work/tk19812/software/Comparative-Annotation-Toolkit/venv/lib/python3.8/site-packages/toil/fileStores/nonCachingFileStore.py", line 54, in open
self._removeDeadJobs(self.workDir)
File "/work/tk19812/software/Comparative-Annotation-Toolkit/venv/lib/python3.8/site-packages/toil/fileStores/nonCachingFileStore.py", line 186, in _removeDeadJobs
if not process_name_exists(nodeInfo, jobState['jobProcessName']):
File "/work/tk19812/software/Comparative-Annotation-Toolkit/venv/lib/python3.8/site-packages/toil/lib/threading.py", line 280, in process_name_exists
nameFD = os.open(nameFileName, os.O_RDONLY)
FileNotFoundError: [Errno 2] No such file or directory: '/work/tk19812/HeliconiniiProject/HeliconGenomeAlignmentAnnotation/63Nymphalidae.Cactus.Cactus.Eisa.withCAT.Eisa.TransMap.outDir/node-ea937378-b401-4cca-b439-5306eb59a1f1-4bf74fb8-097a-4c81-8fa8-dbc02029071e/tmpbo_gizxw'
[2021-05-21T19:24:44+0100] [MainThread] [E] [toil.worker] Exiting the worker because of a failed job on host bp1-compute00194.data.bp.acrc.priv
<=========
WARNING: 2021-05-22 10:02:43,026 - Will not run Task: FindDenovoParents for exRef or any dependencies due to error in complete() method:
Traceback (most recent call last):
File "/work/tk19812/software/Comparative-Annotation-Toolkit/venv/lib/python3.8/site-packages/luigi/worker.py", line 401, in check_complete
is_complete = task.complete()
File "/work/tk19812/software/Comparative-Annotation-Toolkit/venv/lib/python3.8/site-packages/luigi/task.py", line 563, in complete
outputs = flatten(self.output())
File "/work/tk19812/software/Comparative-Annotation-Toolkit/venv/lib/python3.8/site-packages/luigi/task.py", line 883, in flatten
for result in iterator:
File "/work/tk19812/software/Comparative-Annotation-Toolkit/cat/__init__.py", line 1697, in output
denovo_args = FindDenovoParents.get_args(pipeline_args, self.mode)
File "/work/tk19812/software/Comparative-Annotation-Toolkit/cat/__init__.py", line 1661, in get_args
filtered_tm_gp_files = {genome: TransMap.get_args(pipeline_args, genome).filtered_tm_gp
File "/work/tk19812/software/Comparative-Annotation-Toolkit/cat/__init__.py", line 1661, in <dictcomp>
filtered_tm_gp_files = {genome: TransMap.get_args(pipeline_args, genome).filtered_tm_gp
File "/work/tk19812/software/Comparative-Annotation-Toolkit/cat/__init__.py", line 1200, in get_args
args.chain_file = Chaining.get_args(pipeline_args).chain_files[genome]
KeyError: 'Hher'
===== Luigi Execution Summary =====
Scheduled 70 tasks of which:
* 12 complete ones were encountered:
- 1 BuildDb(...)
- 1 Chaining(...)
- 2 FilterTransMap(...)
- 1 GenomeFiles(...)
- 1 PrepareFiles(...)
...
* 21 ran successfully:
- 3 AlignTranscriptDriverTask(...)
- 1 AlignTranscripts(...)
- 1 Augustus(...)
- 3 AugustusDriverTask(...)
- 3 EvaluateDriverTask(...)
...
* 1 failed scheduling:
- 1 FindDenovoParents(...)
* 36 were left pending, among these:
* 4 had dependencies whose scheduling failed:
- 3 ConsensusDriverTask(...)
- 1 HgmDriverTask(...)
* 32 was not granted run permission by the scheduler:
- 1 AssemblyHub(...)
- 7 BgpTrack(...)
- 1 Consensus(...)
- 3 ConsensusTrack(...)
- 1 CreateDirectoryStructure(...)
...
This progress looks :( because there were tasks whose scheduling failed
===== Luigi Execution Summary =====
Not sure immediately what the problem is, what was your command for CAT? Was Hher
one of the genomes you are annotating in your subset or not? Did you change the config file at all? Maybe try removing the annotations for the genomes you are not using from the config?
The command line was the one you suggested me, Hher
wasn't in the target genomes.
I can try changing the conf file.
F
Ok, Done.
I see some improvement, but still... [Hmel201001o:3742294-3764583]
[Hmel201001o:3,952,947-3,960,869]
[Hmel201001o:3,983,949-3,996,848]
Indels are still there... F
These results do look quite a bit better to me, at least in these examples. The indels in the transMap are getting cleaned up in the consensus CAT annotation. Do you think this look good enough to continue on with the rest of the genomes?
Yes, it looks good enough. There's almost no gene left from the external_reference
annotation.
I'm currently running the rest of the genomes.
I don't quite understand when the indels will be removed, I'm in your hands about that.
What's left now? Do you want to implement a better selection between transMap
and the external_reference
Thanks a lot! F
In this plot you can see the level of gene completeness. The x-axis is that percentage of the target hit is aligned (100% means the gene is putatively complete) against Uniprot. Dashed line external_reference
, continuous line the CAT annotation.
You can see how CAT generally does not select the best annotated genes. The curves are lower compared with the external_reference
.
Uniprot.AllspeciesCATcomparison.outfmt6.w_pct_hit_length.pdf
F
I have some ideas for improving the selection between transMap and external reference that I'm working on that should improve the gene completeness. Essentially I want to add an option to tell CAT to prefer the external reference transcripts over transMap, and you would turn it on in cases like yours where the external reference is an annotation from the same species and the reference annotation being transMapped is from a different species. I'll let you know if/when I get something working that I'm happy with!
Wouldn't be possible to apply some kind of score and select the best 2 or 3? F
Kind of, the problem is making sure the parent gene/transcript assignments make sense. There can't be multiple things with the same transcriptID...
One other flag you can try is adding the flag --denovo-allow-bad-annot-or-tm
which will add additional predictions that overlapped multiple genes. You can add this with the rebuild-consensus
option (so you don't need to run the whole pipeline from scratch).
Hi,
I'm looking at the and product of CAT, run only using TransMap and excluding the indel classification
r.extend(find_indels(tx, psl, aln_mode))
.I'm now trying to build a table of orthologs and paralogs. The
gp_info
files should contain all info related to it. I didn't find a detailed description of it. It generally seems to be quite intuitive, although I'm a bit confused. The 4th fieldtranscript_class
should be the field important to what I'm doing, because it generally containsortholog
,poor_alignment
orpossible_paralog
, but sometimes it contains gene name instead ( egEisa.Eisa1Z00G52.1.cg14688
).I'm confused, maybe something went wrong?
The other thing is related to the indel classification. Removing that step produces some bias I should be aware of?
Thanks a lot. F