Open lassancejm opened 6 years ago
Hmm, so if I am interpreting what is going on here correctly, this is a case of gene family collapse. This is issue #80. The genes Vmn1r, Vnm1r42, and Vmn1r43 are all mapping to this locus.
This is another issue I hope to address ASAP, as it is a big problem. In developing CAT, I thought far more about detecting new members of a gene family than losing members. There is a method in consensus finding that is supposed to collapse duplicate transcripts, but it does so in an exactly matching fashion by hashing their exonic intervals. All of these transcripts have slightly different intervals, and so did not get removed in this process.
Obviously, I need something smarter, that generally detects the case where paralog resolution leads to multiple genes with significant locus overlap. The trick here is that overlapping genes do exist in reality, and in high quality annotation sets. So I need to work on some heuristics to try and classify the two cases. Probably based on a combination of genomic and exonic overlap, similar to how the parent gene assignment step works for AugustusCGP and AugustusPB.
Glad to read it is on your radar.
I did not realize that this was overlapping with issue #80; closing this then.
Reopening this ticket, as I have noticed that instead of having several overlapping genes, I frequently get no entries in annotations produced after Dec 2017. I am unsure if this is caused by the changes you made to deal with this issue, as I also stopped systematically using augustusPB. Could you remind me which component of CAT makes use of the protein evidence?
Do you have an example? This is possibly related to my changes to transMap filtering that try to resolve this issue. I am not entirely sure what is wrong, as my changes should have led to the genes all being collapsed down into one in the case of overlaps, not lost entirely.
I did also make filtering generally more stringent, however, so it could be the case that your genes of interest got lost in that process instead. If you have a specific example and can send me the unfiltered transMap PSL lines that are relevant I can try to figure it out.
The protein evidence is used by AugustusCGP and AugustusTM/TMR (but RNA-seq would also have to be provided to trigger TMR).
I did not have the files anymore, so took a bit of time to regenerate them.
It seems that your intuition is correct: it may have to do with the transMap filtering. I used the example from above (1st post) and do not find the corresponding transcripts in the filtered PSL.
Here are the corresponding lines from the unfiltered file:
509 126 229 0 1 1 6 22 + ENSMUST00000174046.1-3 943 75 940 scaffold06157 1618 0 886 8 70,211,1,116,1,190,273,2 75,145,357,358,474,475,665,938 0,72,283,285,402,407,598,884
589 156 218 0 3 29 4 33 + ENSMUST00000089419.2-3 1098 106 1098 scaffold06157 1618 0 996 7 283,1,586,2,13,11,67 106,390,391,977,992,1005,1031 0,283,285,884,886,901,929
573 160 218 0 3 15 5 18 + ENSMUST00000089418.4-3 1069 103 1069 scaffold06157 1618 0 969 8 283,1,586,2,13,11,15,40 103,387,388,974,989,1002,1014,1029 0,283,285,884,886,901,913,929
Thanks for looking into this!
Hi,
Sorry for the delay. Just to confirm, those 3 transcripts were all removed by the default filtering, correct? I am seeing that they have identifiers suggesting multiple alignments (-3 suffixes). In order for me to re-create what the filtering step is doing, I need the other alignments. Can you send me either the full transMap PSL or do something like `grep -P '( ENSMUST00000174046.1|ENSMUST00000089419.2|ENSMUST00000089418.4)' to pull them out? Thanks.
The coverage of these alignments look good -- 87-91% or so. But identity is low, 81-85%. It may be the case that the globalNearBest algorithm is picking a lower coverage higher identity alignment that is then being thrown out.
These worked on the old version, right? The commit that is the CAT paper
version is f89a814
, but that lacks the gene family collapse parts that
you are interested in.
On Tue, Jul 3, 2018 at 12:46 PM lassancejm notifications@github.com wrote:
I did not have the files anymore, so took a bit of time to regenerate them.
It seems that your intuition is correct: it may have to do with the transMap filtering. I used the example from above (1st post) and do not find the corresponding transcripts in the filtered PSL.
Here are the corresponding lines from the unfiltered file:
509 126 229 0 1 1 6 22 + ENSMUST00000174046.1-3 943 75 940 scaffold06157 1618 0 886 8 70,211,1,116,1,190,273,2 75,145,357,358,474,475,665,938 0,72,283,285,402,407,598,884 589 156 218 0 3 29 4 33 + ENSMUST00000089419.2-3 1098 106 1098 scaffold06157 1618 0 996 7 283,1,586,2,13,11,67 106,390,391,977,992,1005,1031 0,283,285,884,886,901,929 573 160 218 0 3 15 5 18 + ENSMUST00000089418.4-3 1069 103 1069 scaffold06157 1618 0 969 8 283,1,586,2,13,11,15,40 103,387,388,974,989,1002,1014,1029 0,283,285,884,886,901,913,929
Thanks for looking into this!
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/83#issuecomment-402272301, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXZhQWZwrmLUp-H6E7a490nc58oaFks5uC8ongaJpZM4QcJgI .
-- Ian Fiddes, PhD
Hi Ian,
Finally getting some time to look into this issue again.
Here is the data you requested:
0 6 22 0 0 0 1 4 + ENSMUST00000089418.4-0 1069 936 964 chr3 161151335 104504627 104504659 2 21,7 936,957 104504627,104504652
0 6 22 0 0 0 1 4 + ENSMUST00000089418.4-1 1069 936 964 chr3 161151335 104525050 104525082 2 21,7 936,957 104525050,104525075
437 88 0 0 3 3 3 25 + ENSMUST00000089418.4-2 1069 0 528 chr3 161151335 104547252 104547802 7 5,13,25,21,320,1,140 0,6,20,45,66,387,388 104547252,104547257,104547270,104547315,104547340,104547660,104547662
573 160 218 0 3 15 5 18 + ENSMUST00000089418.4-3 1069 103 1069 scaffold06157 1618 0 969 8283,1,586,2,13,11,15,40 103,387,388,974,989,1002,1014,1029 0,283,285,884,886,901,913,929
617 190 244 0 6 18 7 42 + ENSMUST00000089418.4-4 1069 0 1069 scaffold00850 26277 12621 1371413 5,13,25,21,320,1,586,2,13,11,3,11,40 0,6,20,45,66,387,388,974,989,1002,1014,1018,1029 12621,12626,12639,12684,12709,13029,13031,13630,13632,13647,13659,13662,13674
643 193 215 0 6 18 7 42 - ENSMUST00000089418.4-5 1069 0 1069 scaffold01405 13890 12118 1321113 40,11,3,11,13,2,586,1,320,21,25,13,5 0,40,52,56,67,93,95,681,683,1003,1024,1050,1064 12118,12159,12170,12174,12187,12200,12215,12802,12803,13127,13168,13193,13206
671 184 142 0 4 16 5 40 + ENSMUST00000089418.4-6 1069 0 1013 chr3 161151335 104528756 104529793 10 5,13,25,21,320,1,586,2,13,11 0,6,20,45,66,387,388,974,989,1002 104528756,104528761,104528774,104528819,104528844,104529164,104529166,104529765,104529767,104529782
816 192 44 0 5 17 7 42 - ENSMUST00000089418.4-7 1069 0 1069 chr3 161151335 105786249 105787343 12 40,15,11,13,2,586,1,320,21,25,13,5 0,40,56,67,93,95,681,683,1003,1024,1050,1064 105786249,105786290,105786306,105786319,105786332,105786347,105786934,105786935,105787259,105787300,105787325,105787338
0 5 23 0 0 0 1 4 + ENSMUST00000089419.2-0 1098 939 967 chr3 161151335 104504627 104504659 2 21,7 939,960 104504627,104504652
1 6 23 0 1 937 2 14686 + ENSMUST00000089419.2-1 1098 0 967 chr3 161151335 104510366 104525082 3 2,21,7 0,939,960 104510366,104525050,104525075
434 94 0 0 3 3 4 50 + ENSMUST00000089419.2-2 1098 0 531 chr3 161151335 104547224 104547802 8 3,5,13,25,21,320,1,140 0,3,9,23,48,69,390,391 104547224,104547252,104547257,104547270,104547315,104547340,104547660,104547662
589 156 218 0 3 29 4 33 + ENSMUST00000089419.2-3 1098 106 1098 scaffold06157 1618 0 996 7283,1,586,2,13,11,67 106,390,391,977,992,1005,1031 0,283,285,884,886,901,929
640 184 243 0 5 31 7 81 + ENSMUST00000089419.2-4 1098 0 1098 scaffold00850 26277 12593 1374112 3,5,13,25,21,320,1,586,2,13,11,67 0,3,9,23,48,69,390,391,977,992,1005,1031 12593,12621,12626,12639,12684,12709,13029,13031,13630,13632,13647,13674
653 188 215 0 5 42 6 92 - ENSMUST00000089419.2-5 1098 0 1098 scaffold01405 13890 12091 1323911 67,13,2,586,1,320,21,25,13,5,3 0,93,119,121,707,709,1029,1050,1076,1090,1095 12091,12187,12200,12215,12802,12803,13127,13168,13193,13206,13236
674 175 140 0 4 16 5 63 + ENSMUST00000089419.2-6 1098 0 1005 chr3 161151335 104528728 104529780 10 3,5,13,25,21,320,1,586,2,13 0,3,9,23,48,69,390,391,977,992 104528728,104528756,104528761,104528774,104528819,104528844,104529164,104529166,104529765,104529767
834 189 44 0 5 31 7 78 - ENSMUST00000089419.2-7 1098 0 1098 chr3 161151335 105786222 105787367 12 67,11,13,2,586,1,320,21,25,13,5,3 0,82,93,119,121,707,709,1029,1050,1076,1090,1095 105786222,105786306,105786319,105786332,105786347,105786934,105786935,105787259,105787300,105787325,105787338,105787364
0 7 21 0 0 0 1 4 + ENSMUST00000174046.1-0 943 900 928 chr3 161151335 104504627 104504659 2 21,7 900,921 104504627,104504652
0 7 21 0 0 0 1 4 + ENSMUST00000174046.1-1 943 900 928 chr3 161151335 104525050 104525082 2 21,7 900,921 104525050,104525075
400 75 0 0 2 2 5 12 + ENSMUST00000174046.1-2 943 16 493 chr3 161151335 104547315 104547802 8 21,25,82,211,1,116,1,18 16,37,63,145,357,358,474,475 104547315,104547340,104547365,104547449,104547660,104547662,104547779,104547784
509 126 229 0 1 1 6 22 + ENSMUST00000174046.1-3 943 75 940 scaffold06157 1618 0 886 870,211,1,116,1,190,273,2 75,145,357,358,474,475,665,938 0,72,283,285,402,407,598,884
519 150 253 0 2 2 7 26 + ENSMUST00000174046.1-4 943 16 940 scaffold00850 26277 12684 1363210 21,25,82,211,1,116,1,190,273,2 16,37,63,145,357,358,474,475,665,938 12684,12709,12734,12818,13029,13031,13148,13153,13344,13630
544 156 222 0 2 2 7 26 - ENSMUST00000174046.1-5 943 16 940 scaffold01405 13890 12200 1314810 2,273,190,1,116,1,211,82,25,21 3,5,278,468,469,585,587,798,881,906 12200,12215,12489,12683,12685,12802,12803,13016,13098,13127
621 150 151 0 2 2 7 26 + ENSMUST00000174046.1-6 943 16 940 chr3 161151335 104528819 104529767 10 21,25,82,211,1,116,1,190,273,2 16,37,63,145,357,358,474,475,665,938 104528819,104528844,104528869,104528953,104529164,104529166,104529283,104529288,104529479,104529765
720 153 49 0 2 2 7 26 - ENSMUST00000174046.1-7 943 16 940 chr3 161151335 105786332 105787280 10 2,273,190,1,116,1,211,82,25,21 3,5,278,468,469,585,587,798,881,906 105786332,105786347,105786621,105786815,105786817,105786934,105786935,105787148,105787230,105787259
I should mention a few things:
I used the default filtering, but if that could be tweaked, I would be happy to give it a whirl.
Thanks for your help! JML
@ifiddes Re-pigging on this as I am re-launching CAT with my big dataset.
I am looking at this now. Something is weird with the copy-paste possibly. Some of those rows have 20 columns, some have 21. It seems like the 17th column (block_count
) is missing from all of the entries on chr3
, while all of the scaffold entries are 21. Is this an interleaving of the reference and target PSLs or something?
Anyways, to fully investigate this, it probably will be best if you send me the full transMap PSL, reference PSL, transMap genePred and the reference database. That way I can fully walk the code, keeping in mind the three problem transcripts.
Apparently, I don't have the corresponding file anymore. Re-launching CAT, so I can provide them to you. Sorry about that
Ok, I sent you a folder containing what you requested via a file transfer manager (on your gmail).
Thanks again for spending time on this!
CAT finished earlier today and I compared the raw number of genes in the 'complete' stat category. Basically, it has been dropping over time (between Nov/Dec2017 and now). So, I am thinking that some filtering rules might be too stringent now compared to previous releases. The identity plots indicated that there is on average 80-85% identity between the ref and my beasts.
If you want, you could try re-running with the paper commit version, f89a814. You would be able to keep the existing chaining and augustus_cgp directories, which are most of the compute. Everything else would have to be re-ran.
The major change between then and now is changes in how paralogous alignments are resolved, and the clean-up step that removes collapsed gene families (picks a best gene for a locus where multiple genes have their best mapping to the same region). I will take a look at your data and see if I can tease out the changes. I suspect that the logic I introduced between now and then, particularly for paralog resolution, is actually quite a bit worse than what I originally had for longer branch lengths. I think I over-fitted for the primates. I am considering adding in a flag that runs a version of the original approach to paralog filtering that relied more on synteny than on alignment metrics that will work better on longer branch lengths. I will need to still work in the gene family collapse stuff, because as you pointed out last year overlapping genes are problematic for downstream analysis. But I think the dataset you provided should give me a great testing ground for working this out, so thanks!
On Thu, Oct 4, 2018 at 10:55 AM lassancejm notifications@github.com wrote:
CAT finished earlier today and I compared the raw number of genes in the 'complete' stat category. Basically, it has been dropping over time (between Nov/Dec2017 and now). So, I am thinking that some filtering rules might be too stringent now compared to previous releases. The identity plots indicated that there is on average 80-85% identity between the ref and my beasts.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/83#issuecomment-427112157, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXd6ELmM5CjZEnJMeTJ24q6Sn9Qleks5uhkuagaJpZM4QcJgI .
-- Ian Fiddes, PhD
Thanks. I am suspicious that rolling back to an older version will cause errors with the new release of Augustus. What do you think? I am happy to share more with you, if you think that it helps improving CAT further.
Yes, particularly the change that happened this week. However, it wouldn't be too hard to make that fix on the older version. But, that will only affect augustusCGP, which you could just use the existing outputs for anyways (and will save you a ton of compute to do so). AugustusTMR will work fine, that code hasn't changed since early 2017.
I am still working over the data you sent me, but I am hoping within the next week or so I will have a handle on ideas of what to change to get a better result.
On Fri, Oct 5, 2018 at 11:20 AM lassancejm notifications@github.com wrote:
Thanks. I am suspicious that rolling back to an older version will cause errors with the new release of Augustus. What do you think? I am happy to share more with you, if you think that it helps improving CAT further.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/83#issuecomment-427456064, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXb2eLOeTGAMfFSfm6McDe-f4OZ3Xks5uh6LygaJpZM4QcJgI .
-- Ian Fiddes, PhD
Can I also keep the hints database? That one also takes time to prep
Yes, that also will be unchanged. You can also keep the chaining folder, which takes some time as well. If you just keep those 3 folders (augustus_cgp, chaining, hints_database) the pipeline should rebuild the reference information, run transMap, filter transMap, run TMR, then go straight into pairwise alignments and consensus finding.
On Fri, Oct 5, 2018 at 11:41 AM lassancejm notifications@github.com wrote:
Can I also keep the hints database? That one also takes time to prep
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/83#issuecomment-427461912, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXZH_NAQYOdqwNODt1kUoT0XpdaUMks5uh6fLgaJpZM4QcJgI .
-- Ian Fiddes, PhD
I have figured out what the bug is! Now I need to devise a fix. Basically, when I am filtering out overlapping genes, if we have a transitive relationship between more than two clusters of genes, it can be possible that a cluster is tagged as a winner once and a loser in another time.
I am going to first try and fix this as-is before I go into the more complex process adding alternative filtering modes.
I am attaching a BED. The fix is on branch gene_filtering_bug
commit aa33662. If you want, instead of actually running everything again, take a look at this BED. It is the output of the bug fixed version.
DEERMOUSE.filtered.bugfix.bed.txt
Thanks! I am hoping to have time to give this a serious look next week.
Finally had to chance to dive into this and it seems to be working fine, for the most part. The good news is that the filtering seems to be working well. However, some entries do not make it in the consensus annotation, but I am not sure why.
In this example, although there is multiple lines of evidence, some annotations are not present in the uppermost track, which is the consensus. The bottom track is a consensus from a previous version of CAT (Nov 2017), without paralog collapse.
FYI, I replaced two scripts (__init__.py
and filter_transmap.py
) from the master branch by those in the branch gene_filtering_bug
.
Interesting. This is definitely not desired behavior. I am going to need the reference and target database to figure out why this happened.
On Thu, Dec 20, 2018 at 5:47 AM lassancejm notifications@github.com wrote:
Finally had to chance to dive into this and it seems to be working fine, for the most part. The good news is that the filtering seems to be working well. However, some entries do not make it in the consensus annotation, but I am not sure why.
In this example, although there is multiple lines of evidence, some annotations are not present in the uppermost track, which is the consensus.
FYI, I replaced two scripts (init.py and filter_transmap.py) from the master branch by those in the branch gene_filtering_bug.
[image: image] https://user-images.githubusercontent.com/14118139/50257108-cd584480-03c7-11e9-9a77-5341f2ebff82.png
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/83#issuecomment-449004471, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXf347iNAuBarT1TJw0KNPUWFoX2Pks5u65TXgaJpZM4QcJgI .
-- Ian Fiddes
I just sent you a link to these db. Let me know if you need anything else.
Here are the coordinates for the region shown above, in case it helps to have an example:
chr1 95573429 95657906
Thanks again for looking into this!
Yep, was wondering if you had a chance to look into the files I shared with you?
Sorry, I just had a chance to. Something appears to be going wrong in the pairwise alignments that BLAT is generating for these genes that are causing them to drop out.
Focusing on Olfr66, I found that transMap mapped it over with 78% coverage and 72% identity:
sqlite> select * from TransMapEvaluation where TranscriptId == 'ENSMUST00000216303.1';
GeneId|TranscriptId|AlignmentId|classifier|value
ENSMUSG00000058200.2|ENSMUST00000216303.1|ENSMUST00000216303.1-1|AlnExtendsOffContig|0.0
ENSMUSG00000058200.2|ENSMUST00000216303.1|ENSMUST00000216303.1-1|AlnPartialMap|1.0
ENSMUSG00000058200.2|ENSMUST00000216303.1|ENSMUST00000216303.1-1|AlnAbutsUnknownBases|0.0
ENSMUSG00000058200.2|ENSMUST00000216303.1|ENSMUST00000216303.1-1|PercentN|0.0
ENSMUSG00000058200.2|ENSMUST00000216303.1|ENSMUST00000216303.1-1|TransMapCoverage|78.312
ENSMUSG00000058200.2|ENSMUST00000216303.1|ENSMUST00000216303.1-1|TransMapIdentity|72.318
ENSMUSG00000058200.2|ENSMUST00000216303.1|ENSMUST00000216303.1-1|TransMapGoodness|78.1
ENSMUSG00000058200.2|ENSMUST00000216303.1|ENSMUST00000216303.1-1|TransMapOriginalIntronsPercent|40.0
ENSMUSG00000058200.2|ENSMUST00000216303.1|ENSMUST00000216303.1-1|Synteny|6.0
ENSMUSG00000058200.2|ENSMUST00000216303.1|ENSMUST00000216303.1-1|ValidStart|1.0
ENSMUSG00000058200.2|ENSMUST00000216303.1|ENSMUST00000216303.1-1|ValidStop|1.0
ENSMUSG00000058200.2|ENSMUST00000216303.1|ENSMUST00000216303.1-1|ProperOrf|0.0
But when I look in the BLAT alignments that are used to pick the best version, none of them are in the database:
sqlite> select * from mRNA_transMap_Metrics where AlignmentId like '%ENSMUST00000216303.1%';
sqlite> select * from CDS_transMap_Metrics where AlignmentId like '%ENSMUST00000216303.1%';
sqlite> select * from mRNA_augTMR_Metrics where AlignmentId like '%ENSMUST00000216303.1%';
sqlite> select * from CDS_augTMR_Metrics where AlignmentId like '%ENSMUST00000216303.1%';
So something went wrong in the align_transcripts module it seems. This is troubling because I rely on these BLAT alignments to give an even playing field to augustus transcripts vs. transMap transcripts in a given locus. Given that this hasn't been a problem before, I am wondering if BLAT failed for valid reasons here given that the transMap identity is so low.
I plotted the transMap alignment identity across the entire dataset:
Then compared this to CDS-only and full mRNA BLAT alignments across the entire dataset:
You can see that these genes would be on the tail of those distributions, but not a outlier. So, I am not entirely sure why BLAT failed entirely to return an alignment.
Questions: 1) If you extract that gene in your deermouse genome and BLAT it against mm10 on the browser, what happens? 2) If you send me the FASTA, reference genePred and transMap genePred I can try to reproduce what CAT is doing and see why command-line BLAT responds differently than the browser (which often happens actually, the browser version has fancy smart parameterization).
Failing all else, I could re-write this logic so that even transcripts that fail pairwise alignment can get through. In that case transMap would always be chosen over augustus, but that could be fine. When I implemented it the way it is now, my reasoning was that Cactus can sometimes produce very spurious alignments that would get dropped by the requirement that the transcript-level pairwise alignments succeed. However, your locus looks great from a syntenic point of view, and seems correct to me at first glance.
Hi Ian,
Thanks for looking into this issue. Apparently, a cleanup of the filesystem where I am storing the CAT intermediate files took place, so I have had to restart the pipeline.
Questions:
If you extract that gene in your deermouse genome and BLAT it against mm10 on the browser, what happens?
I ran blat against mm10 on the browser using the corresponding genomic region and am attaching the results. Basically, best hit is the region corresponding to Olfr66. Then, there are two hits corresponding to the neighboring paralogs (CDS only apparently).
blat.out.psl.txt olfr66-like.fasta.txt
If you send me the FASTA, reference genePred and transMap genePred I can try to reproduce what CAT is doing and see why command-line BLAT responds differently than the browser (which often happens actually, the browser version has fancy smart parameterization).
Happy to share whatever you need. Which fasta do you mean? The reference's?
Failing all else, I could re-write this logic so that even transcripts that fail pairwise alignment can get through. In that case transMap would always be chosen over augustus, but that could be fine. When I implemented it the way it is now, my reasoning was that Cactus can sometimes produce very spurious alignments that would get dropped by the requirement that the transcript-level pairwise alignments succeed. However, your locus looks great from a syntenic point of view, and seems correct to me at first glance.
I am hoping this does not have to happen.
I need the fasta for the deer mouse, as well as the genePreds so that I can extract the transcript sequences and do the commandline pairwise blat myself to see what the alignments look like. I can get mm10 myself, so just send the assembly and the reference and target (transMap) genePred files. Thanks!
Ian Fiddes
On Jan 14, 2019, at 8:37 AM, lassancejm notifications@github.com wrote:
Hi Ian,
Thanks for looking into this issue. Apparently, a cleanup of the filesystem where I am storing the CAT intermediate files took place, so I have had to restart the pipeline.
Questions:
If you extract that gene in your deermouse genome and BLAT it against mm10 on the browser, what happens?
I ran blat against mm10 on the browser using the corresponding genomic region and am attaching the results. Basically, best hit is the region corresponding to Olfr66. Then, there are two hits corresponding to the neighboring paralogs (CDS only apparently).
blat.out.psl.txt olfr66-like.fasta.txt
If you send me the FASTA, reference genePred and transMap genePred I can try to reproduce what CAT is doing and see why command-line BLAT responds differently than the browser (which often happens actually, the browser version has fancy smart parameterization).
Happy to share whatever you need. Which fasta do you mean? The reference's?
Failing all else, I could re-write this logic so that even transcripts that fail pairwise alignment can get through. In that case transMap would always be chosen over augustus, but that could be fine. When I implemented it the way it is now, my reasoning was that Cactus can sometimes produce very spurious alignments that would get dropped by the requirement that the transcript-level pairwise alignments succeed. However, your locus looks great from a syntenic point of view, and seems correct to me at first glance.
I am hoping this does not have to happen.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Link sent privately, thanks! JM
I have the data, so far I am not able to reproduce this problem by hand. I am getting a reasonable alignment out of BLAT when I use these...
Can you also send me the PSL in transcript_alignment
folder? transcript_alignment/DEERMOUSE.transMap.mRNA.psl
so that I can see if when I run it by hand I get the same outputs?
Unfortunately, I am still waiting for the transcript_alignment
folder and its content to be re-generated. Will share as soon as I have it. I apologize for the delay.
The folder has not been generated yet (CAT is running Augustus), is it expected? WHen should it be created and populated?
The alignment stage doesn’t run until after augustusTMR runs (because it aligns those as well).
If you don’t want to run it all again, you could run the pipeline without —Augustus. You can always come back later and run with it by just adding the —rebuild-consensus flag.
On Fri, Jan 18, 2019 at 7:32 AM lassancejm notifications@github.com wrote:
The folder has not been generated yet (CAT is running Augustus), is it expected? WHen should it be created and populated?
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/83#issuecomment-455585844, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXRCumK37QmsE6dcJEI-wmhXMu1ddks5vEekFgaJpZM4QcJgI .
-- Ian Fiddes
I let the pipeline run to completion with Augustus step included and it finally finished yesterday. The good news is that the missing annotations are now included. The bad news is that I think I made you waste time on this. I am not sure what happened exactly, but it seems that having to restart CAT more or less from scratch solved the problem. I am really sorry about that.
Interesting! No worries on my end, I am glad it worked out, and slightly worried about inconsistent behavior. Which annotation set did the successful copies come from?
As far as I can tell, it is variable (transmap, augTM, augTMR, etc.), which is a good thing. I guess the lesson is that I should be more careful about the files that I can really keep around when restarting the pipeline (only thing I had left now where the hints and chaining files).
Ya CAT is not as robust to this as I wanted it to be. If I had to guess, the transcript alignment step did something weird in the past, and the tables lived on in your databases, and didn't get properly overwritten because they still had their completion flag and the logic that is supposed to trigger a restart isn't quite robust enough. I will take a look at the DB you sent me previously, the timestamps might be informative and help me figure out what went wrong.
Sounds good.
In attempt to incorporate external annotation, I ran gffcompare with one of the consensus output from CAT.
gffcompare warned that there was a number of redundant features. I pulled one of those examples to have your opinion. Not only is this going to artificially inflate the # of genes, but it will also lead to problem if I extract protein sequences to look at , say, the evolution of family size.
Thoughts about how to deal with these cases are welcome!