ComparativeGenomicsToolkit / Comparative-Annotation-Toolkit

Apache License 2.0
168 stars 48 forks source link

indesirable behavior/bug? : overlapping genes #83

Open lassancejm opened 6 years ago

lassancejm commented 6 years ago

In attempt to incorporate external annotation, I ran gffcompare with one of the consensus output from CAT.

gffcompare warned that there was a number of redundant features. I pulled one of those examples to have your opinion. Not only is this going to artificially inflate the # of genes, but it will also lead to problem if I extract protein sequences to look at , say, the evolution of family size.

Thoughts about how to deal with these cases are welcome!

scaffold06157   CAT gene    1   886 .   +   .   ID=BEAST_G0056651;Name=Vmn1r-ps34;gene_biotype=unprocessed_pseudogene;source_gene=ENSMUSG00000099754.1;source_gene_common_name=Vmn1r-ps34;transcript_modes=transMap
scaffold06157   CAT transcript  1   886 8300    +   .   ID=ENSMUST00000174046.1-3;Name=Vmn1r-ps34;Parent=BEAST_G0056651;alignment_id=ENSMUST00000174046.1-3;frameshift=nan;gene_biotype=unprocessed_pseudogene;gene_id=BEAST_G0056651;paralogy=ENSMUST00000174046.1-2,ENSMUST00000174046.1-5,ENSMUST00000174046.1-4,ENSMUST00000174046.1-7,ENSMUST00000174046.1-6;proper_orf=True;source_gene=ENSMUSG00000099754.1;source_gene_common_name=Vmn1r-ps34;source_transcript=ENSMUST00000174046.1;source_transcript_name=Vmn1r-ps34-201;transcript_biotype=unprocessed_pseudogene;transcript_class=ortholog;transcript_id=BEAST_T0131552;transcript_modes=transMap;valid_start=True;valid_stop=True
scaffold06157   CAT exon    1   886 .   +   .   ID=exon:ENSMUST00000174046.1-3:0;Name=Vmn1r-ps34;Parent=ENSMUST00000174046.1-3;alignment_id=ENSMUST00000174046.1-3;gene_biotype=unprocessed_pseudogene;gene_id=BEAST_G0056651;paralogy=ENSMUST00000174046.1-2,ENSMUST00000174046.1-5,ENSMUST00000174046.1-4,ENSMUST00000174046.1-7,ENSMUST00000174046.1-6;proper_orf=True;source_gene=ENSMUSG00000099754.1;source_gene_common_name=Vmn1r-ps34;source_transcript=ENSMUST00000174046.1;source_transcript_name=Vmn1r-ps34-201;transcript_biotype=unprocessed_pseudogene;transcript_class=ortholog;transcript_id=BEAST_T0131552;transcript_modes=transMap;valid_start=True;valid_stop=True
scaffold06157   CAT gene    1   996 .   +   .   ID=BEAST_G0056652;Name=Vmn1r42;gene_biotype=protein_coding;source_gene=ENSMUSG00000068232.2;source_gene_common_name=Vmn1r42;transcript_modes=augTMR,augTM
scaffold06157   CAT transcript  1   996 8290    +   .   ID=augTM-ENSMUST00000089419.2-3;Name=Vmn1r42;Parent=BEAST_G0056652;alignment_id=augTM-ENSMUST00000089419.2-3;frameshift=False;gene_biotype=protein_coding;gene_id=BEAST_G0056652;paralogy=ENSMUST00000089419.2-5,ENSMUST00000089419.2-4,ENSMUST00000089419.2-7,ENSMUST00000089419.2-6,ENSMUST00000089419.2-2;proper_orf=False;source_gene=ENSMUSG00000068232.2;source_gene_common_name=Vmn1r42;source_transcript=ENSMUST00000089419.2;source_transcript_name=Vmn1r42-201;transcript_biotype=protein_coding;transcript_class=ortholog;transcript_id=BEAST_T0131553;transcript_modes=augTM,augTMR;valid_start=False;valid_stop=False
scaffold06157   CAT exon    1   996 .   +   .   ID=exon:augTM-ENSMUST00000089419.2-3:0;Name=Vmn1r42;Parent=augTM-ENSMUST00000089419.2-3;alignment_id=augTM-ENSMUST00000089419.2-3;gene_biotype=protein_coding;gene_id=BEAST_G0056652;paralogy=ENSMUST00000089419.2-5,ENSMUST00000089419.2-4,ENSMUST00000089419.2-7,ENSMUST00000089419.2-6,ENSMUST00000089419.2-2;proper_orf=False;source_gene=ENSMUSG00000068232.2;source_gene_common_name=Vmn1r42;source_transcript=ENSMUST00000089419.2;source_transcript_name=Vmn1r42-201;transcript_biotype=protein_coding;transcript_class=ortholog;transcript_id=BEAST_T0131553;transcript_modes=augTM,augTMR;valid_start=False;valid_stop=False
scaffold06157   CAT CDS 1   905 .   +   2   ID=CDS:augTM-ENSMUST00000089419.2-3:0;Name=Vmn1r42;Parent=augTM-ENSMUST00000089419.2-3;alignment_id=augTM-ENSMUST00000089419.2-3;gene_biotype=protein_coding;gene_id=BEAST_G0056652;paralogy=ENSMUST00000089419.2-5,ENSMUST00000089419.2-4,ENSMUST00000089419.2-7,ENSMUST00000089419.2-6,ENSMUST00000089419.2-2;proper_orf=False;source_gene=ENSMUSG00000068232.2;source_gene_common_name=Vmn1r42;source_transcript=ENSMUST00000089419.2;source_transcript_name=Vmn1r42-201;transcript_biotype=protein_coding;transcript_class=ortholog;transcript_id=BEAST_T0131553;transcript_modes=augTM,augTMR;valid_start=False;valid_stop=False
scaffold06157   CAT gene    1   969 .   +   .   ID=BEAST_G0056653;Name=Vmn1r43;gene_biotype=protein_coding;source_gene=ENSMUSG00000068231.4;source_gene_common_name=Vmn1r43;transcript_modes=augTMR
scaffold06157   CAT transcript  1   969 8220    +   .   ID=augTMR-ENSMUST00000089418.4-3;Name=Vmn1r43;Parent=BEAST_G0056653;alignment_id=augTMR-ENSMUST00000089418.4-3;frameshift=False;gene_biotype=protein_coding;gene_id=BEAST_G0056653;paralogy=ENSMUST00000089418.4-6,ENSMUST00000089418.4-7,ENSMUST00000089418.4-4,ENSMUST00000089418.4-5,ENSMUST00000089418.4-2;proper_orf=False;source_gene=ENSMUSG00000068231.4;source_gene_common_name=Vmn1r43;source_transcript=ENSMUST00000089418.4;source_transcript_name=Vmn1r43-201;transcript_biotype=protein_coding;transcript_class=ortholog;transcript_id=BEAST_T0131554;transcript_modes=augTMR;valid_start=False;valid_stop=False
scaffold06157   CAT exon    1   969 .   +   .   ID=exon:augTMR-ENSMUST00000089418.4-3:0;Name=Vmn1r43;Parent=augTMR-ENSMUST00000089418.4-3;alignment_id=augTMR-ENSMUST00000089418.4-3;gene_biotype=protein_coding;gene_id=BEAST_G0056653;paralogy=ENSMUST00000089418.4-6,ENSMUST00000089418.4-7,ENSMUST00000089418.4-4,ENSMUST00000089418.4-5,ENSMUST00000089418.4-2;proper_orf=False;source_gene=ENSMUSG00000068231.4;source_gene_common_name=Vmn1r43;source_transcript=ENSMUST00000089418.4;source_transcript_name=Vmn1r43-201;transcript_biotype=protein_coding;transcript_class=ortholog;transcript_id=BEAST_T0131554;transcript_modes=augTMR;valid_start=False;valid_stop=False
scaffold06157   CAT CDS 1   905 .   +   2   ID=CDS:augTMR-ENSMUST00000089418.4-3:0;Name=Vmn1r43;Parent=augTMR-ENSMUST00000089418.4-3;alignment_id=augTMR-ENSMUST00000089418.4-3;gene_biotype=protein_coding;gene_id=BEAST_G0056653;paralogy=ENSMUST00000089418.4-6,ENSMUST00000089418.4-7,ENSMUST00000089418.4-4,ENSMUST00000089418.4-5,ENSMUST00000089418.4-2;proper_orf=False;source_gene=ENSMUSG00000068231.4;source_gene_common_name=Vmn1r43;source_transcript=ENSMUST00000089418.4;source_transcript_name=Vmn1r43-201;transcript_biotype=protein_coding;transcript_class=ortholog;transcript_id=BEAST_T0131554;transcript_modes=augTMR;valid_start=False;valid_stop=False
ifiddes commented 6 years ago

Hmm, so if I am interpreting what is going on here correctly, this is a case of gene family collapse. This is issue #80. The genes Vmn1r, Vnm1r42, and Vmn1r43 are all mapping to this locus.

This is another issue I hope to address ASAP, as it is a big problem. In developing CAT, I thought far more about detecting new members of a gene family than losing members. There is a method in consensus finding that is supposed to collapse duplicate transcripts, but it does so in an exactly matching fashion by hashing their exonic intervals. All of these transcripts have slightly different intervals, and so did not get removed in this process.

Obviously, I need something smarter, that generally detects the case where paralog resolution leads to multiple genes with significant locus overlap. The trick here is that overlapping genes do exist in reality, and in high quality annotation sets. So I need to work on some heuristics to try and classify the two cases. Probably based on a combination of genomic and exonic overlap, similar to how the parent gene assignment step works for AugustusCGP and AugustusPB.

lassancejm commented 6 years ago

Glad to read it is on your radar.

I did not realize that this was overlapping with issue #80; closing this then.

lassancejm commented 6 years ago

Reopening this ticket, as I have noticed that instead of having several overlapping genes, I frequently get no entries in annotations produced after Dec 2017. I am unsure if this is caused by the changes you made to deal with this issue, as I also stopped systematically using augustusPB. Could you remind me which component of CAT makes use of the protein evidence?

ifiddes-10x-zz commented 6 years ago

Do you have an example? This is possibly related to my changes to transMap filtering that try to resolve this issue. I am not entirely sure what is wrong, as my changes should have led to the genes all being collapsed down into one in the case of overlaps, not lost entirely.

I did also make filtering generally more stringent, however, so it could be the case that your genes of interest got lost in that process instead. If you have a specific example and can send me the unfiltered transMap PSL lines that are relevant I can try to figure it out.

The protein evidence is used by AugustusCGP and AugustusTM/TMR (but RNA-seq would also have to be provided to trigger TMR).

lassancejm commented 6 years ago

I did not have the files anymore, so took a bit of time to regenerate them.

It seems that your intuition is correct: it may have to do with the transMap filtering. I used the example from above (1st post) and do not find the corresponding transcripts in the filtered PSL.

Here are the corresponding lines from the unfiltered file:

509 126 229 0   1   1   6   22  +   ENSMUST00000174046.1-3  943 75  940 scaffold06157   1618    0   886 8   70,211,1,116,1,190,273,2    75,145,357,358,474,475,665,938  0,72,283,285,402,407,598,884
589 156 218 0   3   29  4   33  +   ENSMUST00000089419.2-3  1098    106 1098    scaffold06157   1618    0   996 7   283,1,586,2,13,11,67    106,390,391,977,992,1005,1031   0,283,285,884,886,901,929
573 160 218 0   3   15  5   18  +   ENSMUST00000089418.4-3  1069    103 1069    scaffold06157   1618    0   969 8   283,1,586,2,13,11,15,40 103,387,388,974,989,1002,1014,1029  0,283,285,884,886,901,913,929

Thanks for looking into this!

ifiddes commented 6 years ago

Hi,

Sorry for the delay. Just to confirm, those 3 transcripts were all removed by the default filtering, correct? I am seeing that they have identifiers suggesting multiple alignments (-3 suffixes). In order for me to re-create what the filtering step is doing, I need the other alignments. Can you send me either the full transMap PSL or do something like `grep -P '( ENSMUST00000174046.1|ENSMUST00000089419.2|ENSMUST00000089418.4)' to pull them out? Thanks.

The coverage of these alignments look good -- 87-91% or so. But identity is low, 81-85%. It may be the case that the globalNearBest algorithm is picking a lower coverage higher identity alignment that is then being thrown out.

These worked on the old version, right? The commit that is the CAT paper version is f89a814, but that lacks the gene family collapse parts that you are interested in.

On Tue, Jul 3, 2018 at 12:46 PM lassancejm notifications@github.com wrote:

I did not have the files anymore, so took a bit of time to regenerate them.

It seems that your intuition is correct: it may have to do with the transMap filtering. I used the example from above (1st post) and do not find the corresponding transcripts in the filtered PSL.

Here are the corresponding lines from the unfiltered file:

509 126 229 0 1 1 6 22 + ENSMUST00000174046.1-3 943 75 940 scaffold06157 1618 0 886 8 70,211,1,116,1,190,273,2 75,145,357,358,474,475,665,938 0,72,283,285,402,407,598,884 589 156 218 0 3 29 4 33 + ENSMUST00000089419.2-3 1098 106 1098 scaffold06157 1618 0 996 7 283,1,586,2,13,11,67 106,390,391,977,992,1005,1031 0,283,285,884,886,901,929 573 160 218 0 3 15 5 18 + ENSMUST00000089418.4-3 1069 103 1069 scaffold06157 1618 0 969 8 283,1,586,2,13,11,15,40 103,387,388,974,989,1002,1014,1029 0,283,285,884,886,901,913,929

Thanks for looking into this!

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/83#issuecomment-402272301, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXZhQWZwrmLUp-H6E7a490nc58oaFks5uC8ongaJpZM4QcJgI .

-- Ian Fiddes, PhD

lassancejm commented 6 years ago

Hi Ian,

Finally getting some time to look into this issue again.

Here is the data you requested:

0   6   22  0   0   0   1   4   +   ENSMUST00000089418.4-0  1069    936 964 chr3    161151335   104504627   104504659   2   21,7    936,957 104504627,104504652
0   6   22  0   0   0   1   4   +   ENSMUST00000089418.4-1  1069    936 964 chr3    161151335   104525050   104525082   2   21,7    936,957 104525050,104525075
437 88  0   0   3   3   3   25  +   ENSMUST00000089418.4-2  1069    0   528 chr3    161151335   104547252   104547802   7   5,13,25,21,320,1,140    0,6,20,45,66,387,388    104547252,104547257,104547270,104547315,104547340,104547660,104547662
573 160 218 0   3   15  5   18  +   ENSMUST00000089418.4-3  1069    103 1069    scaffold06157   1618    0   969 8283,1,586,2,13,11,15,40    103,387,388,974,989,1002,1014,1029  0,283,285,884,886,901,913,929
617 190 244 0   6   18  7   42  +   ENSMUST00000089418.4-4  1069    0   1069    scaffold00850   26277   12621   1371413 5,13,25,21,320,1,586,2,13,11,3,11,40    0,6,20,45,66,387,388,974,989,1002,1014,1018,1029    12621,12626,12639,12684,12709,13029,13031,13630,13632,13647,13659,13662,13674
643 193 215 0   6   18  7   42  -   ENSMUST00000089418.4-5  1069    0   1069    scaffold01405   13890   12118   1321113 40,11,3,11,13,2,586,1,320,21,25,13,5    0,40,52,56,67,93,95,681,683,1003,1024,1050,1064 12118,12159,12170,12174,12187,12200,12215,12802,12803,13127,13168,13193,13206
671 184 142 0   4   16  5   40  +   ENSMUST00000089418.4-6  1069    0   1013    chr3    161151335   104528756   104529793   10  5,13,25,21,320,1,586,2,13,11    0,6,20,45,66,387,388,974,989,1002   104528756,104528761,104528774,104528819,104528844,104529164,104529166,104529765,104529767,104529782
816 192 44  0   5   17  7   42  -   ENSMUST00000089418.4-7  1069    0   1069    chr3    161151335   105786249   105787343   12  40,15,11,13,2,586,1,320,21,25,13,5  0,40,56,67,93,95,681,683,1003,1024,1050,1064    105786249,105786290,105786306,105786319,105786332,105786347,105786934,105786935,105787259,105787300,105787325,105787338
0   5   23  0   0   0   1   4   +   ENSMUST00000089419.2-0  1098    939 967 chr3    161151335   104504627   104504659   2   21,7    939,960 104504627,104504652
1   6   23  0   1   937 2   14686   +   ENSMUST00000089419.2-1  1098    0   967 chr3    161151335   104510366   104525082   3   2,21,7  0,939,960   104510366,104525050,104525075
434 94  0   0   3   3   4   50  +   ENSMUST00000089419.2-2  1098    0   531 chr3    161151335   104547224   104547802   8   3,5,13,25,21,320,1,140  0,3,9,23,48,69,390,391  104547224,104547252,104547257,104547270,104547315,104547340,104547660,104547662
589 156 218 0   3   29  4   33  +   ENSMUST00000089419.2-3  1098    106 1098    scaffold06157   1618    0   996 7283,1,586,2,13,11,67   106,390,391,977,992,1005,1031   0,283,285,884,886,901,929
640 184 243 0   5   31  7   81  +   ENSMUST00000089419.2-4  1098    0   1098    scaffold00850   26277   12593   1374112 3,5,13,25,21,320,1,586,2,13,11,67   0,3,9,23,48,69,390,391,977,992,1005,1031    12593,12621,12626,12639,12684,12709,13029,13031,13630,13632,13647,13674
653 188 215 0   5   42  6   92  -   ENSMUST00000089419.2-5  1098    0   1098    scaffold01405   13890   12091   1323911 67,13,2,586,1,320,21,25,13,5,3  0,93,119,121,707,709,1029,1050,1076,1090,1095   12091,12187,12200,12215,12802,12803,13127,13168,13193,13206,13236
674 175 140 0   4   16  5   63  +   ENSMUST00000089419.2-6  1098    0   1005    chr3    161151335   104528728   104529780   10  3,5,13,25,21,320,1,586,2,13 0,3,9,23,48,69,390,391,977,992  104528728,104528756,104528761,104528774,104528819,104528844,104529164,104529166,104529765,104529767
834 189 44  0   5   31  7   78  -   ENSMUST00000089419.2-7  1098    0   1098    chr3    161151335   105786222   105787367   12  67,11,13,2,586,1,320,21,25,13,5,3   0,82,93,119,121,707,709,1029,1050,1076,1090,1095    105786222,105786306,105786319,105786332,105786347,105786934,105786935,105787259,105787300,105787325,105787338,105787364
0   7   21  0   0   0   1   4   +   ENSMUST00000174046.1-0  943 900 928 chr3    161151335   104504627   104504659   2   21,7    900,921 104504627,104504652
0   7   21  0   0   0   1   4   +   ENSMUST00000174046.1-1  943 900 928 chr3    161151335   104525050   104525082   2   21,7    900,921 104525050,104525075
400 75  0   0   2   2   5   12  +   ENSMUST00000174046.1-2  943 16  493 chr3    161151335   104547315   104547802   8   21,25,82,211,1,116,1,18 16,37,63,145,357,358,474,475    104547315,104547340,104547365,104547449,104547660,104547662,104547779,104547784
509 126 229 0   1   1   6   22  +   ENSMUST00000174046.1-3  943 75  940 scaffold06157   1618    0   886 870,211,1,116,1,190,273,2   75,145,357,358,474,475,665,938  0,72,283,285,402,407,598,884
519 150 253 0   2   2   7   26  +   ENSMUST00000174046.1-4  943 16  940 scaffold00850   26277   12684   1363210 21,25,82,211,1,116,1,190,273,2  16,37,63,145,357,358,474,475,665,938    12684,12709,12734,12818,13029,13031,13148,13153,13344,13630
544 156 222 0   2   2   7   26  -   ENSMUST00000174046.1-5  943 16  940 scaffold01405   13890   12200   1314810 2,273,190,1,116,1,211,82,25,21  3,5,278,468,469,585,587,798,881,906 12200,12215,12489,12683,12685,12802,12803,13016,13098,13127
621 150 151 0   2   2   7   26  +   ENSMUST00000174046.1-6  943 16  940 chr3    161151335   104528819   104529767   10  21,25,82,211,1,116,1,190,273,2  16,37,63,145,357,358,474,475,665,938    104528819,104528844,104528869,104528953,104529164,104529166,104529283,104529288,104529479,104529765
720 153 49  0   2   2   7   26  -   ENSMUST00000174046.1-7  943 16  940 chr3    161151335   105786332   105787280   10  2,273,190,1,116,1,211,82,25,21  3,5,278,468,469,585,587,798,881,906 105786332,105786347,105786621,105786815,105786817,105786934,105786935,105787148,105787230,105787259

I should mention a few things:

I used the default filtering, but if that could be tweaked, I would be happy to give it a whirl.

Thanks for your help! JML

lassancejm commented 6 years ago

@ifiddes Re-pigging on this as I am re-launching CAT with my big dataset.

ifiddes commented 6 years ago

I am looking at this now. Something is weird with the copy-paste possibly. Some of those rows have 20 columns, some have 21. It seems like the 17th column (block_count) is missing from all of the entries on chr3, while all of the scaffold entries are 21. Is this an interleaving of the reference and target PSLs or something?

Anyways, to fully investigate this, it probably will be best if you send me the full transMap PSL, reference PSL, transMap genePred and the reference database. That way I can fully walk the code, keeping in mind the three problem transcripts.

lassancejm commented 6 years ago

Apparently, I don't have the corresponding file anymore. Re-launching CAT, so I can provide them to you. Sorry about that

lassancejm commented 6 years ago

Ok, I sent you a folder containing what you requested via a file transfer manager (on your gmail).

Thanks again for spending time on this!

lassancejm commented 6 years ago

CAT finished earlier today and I compared the raw number of genes in the 'complete' stat category. Basically, it has been dropping over time (between Nov/Dec2017 and now). So, I am thinking that some filtering rules might be too stringent now compared to previous releases. The identity plots indicated that there is on average 80-85% identity between the ref and my beasts.

ifiddes commented 6 years ago

If you want, you could try re-running with the paper commit version, f89a814. You would be able to keep the existing chaining and augustus_cgp directories, which are most of the compute. Everything else would have to be re-ran.

The major change between then and now is changes in how paralogous alignments are resolved, and the clean-up step that removes collapsed gene families (picks a best gene for a locus where multiple genes have their best mapping to the same region). I will take a look at your data and see if I can tease out the changes. I suspect that the logic I introduced between now and then, particularly for paralog resolution, is actually quite a bit worse than what I originally had for longer branch lengths. I think I over-fitted for the primates. I am considering adding in a flag that runs a version of the original approach to paralog filtering that relied more on synteny than on alignment metrics that will work better on longer branch lengths. I will need to still work in the gene family collapse stuff, because as you pointed out last year overlapping genes are problematic for downstream analysis. But I think the dataset you provided should give me a great testing ground for working this out, so thanks!

On Thu, Oct 4, 2018 at 10:55 AM lassancejm notifications@github.com wrote:

CAT finished earlier today and I compared the raw number of genes in the 'complete' stat category. Basically, it has been dropping over time (between Nov/Dec2017 and now). So, I am thinking that some filtering rules might be too stringent now compared to previous releases. The identity plots indicated that there is on average 80-85% identity between the ref and my beasts.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/83#issuecomment-427112157, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXd6ELmM5CjZEnJMeTJ24q6Sn9Qleks5uhkuagaJpZM4QcJgI .

-- Ian Fiddes, PhD

lassancejm commented 6 years ago

Thanks. I am suspicious that rolling back to an older version will cause errors with the new release of Augustus. What do you think? I am happy to share more with you, if you think that it helps improving CAT further.

ifiddes commented 6 years ago

Yes, particularly the change that happened this week. However, it wouldn't be too hard to make that fix on the older version. But, that will only affect augustusCGP, which you could just use the existing outputs for anyways (and will save you a ton of compute to do so). AugustusTMR will work fine, that code hasn't changed since early 2017.

I am still working over the data you sent me, but I am hoping within the next week or so I will have a handle on ideas of what to change to get a better result.

On Fri, Oct 5, 2018 at 11:20 AM lassancejm notifications@github.com wrote:

Thanks. I am suspicious that rolling back to an older version will cause errors with the new release of Augustus. What do you think? I am happy to share more with you, if you think that it helps improving CAT further.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/83#issuecomment-427456064, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXb2eLOeTGAMfFSfm6McDe-f4OZ3Xks5uh6LygaJpZM4QcJgI .

-- Ian Fiddes, PhD

lassancejm commented 6 years ago

Can I also keep the hints database? That one also takes time to prep

ifiddes commented 6 years ago

Yes, that also will be unchanged. You can also keep the chaining folder, which takes some time as well. If you just keep those 3 folders (augustus_cgp, chaining, hints_database) the pipeline should rebuild the reference information, run transMap, filter transMap, run TMR, then go straight into pairwise alignments and consensus finding.

On Fri, Oct 5, 2018 at 11:41 AM lassancejm notifications@github.com wrote:

Can I also keep the hints database? That one also takes time to prep

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/83#issuecomment-427461912, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXZH_NAQYOdqwNODt1kUoT0XpdaUMks5uh6fLgaJpZM4QcJgI .

-- Ian Fiddes, PhD

ifiddes commented 6 years ago

I have figured out what the bug is! Now I need to devise a fix. Basically, when I am filtering out overlapping genes, if we have a transitive relationship between more than two clusters of genes, it can be possible that a cluster is tagged as a winner once and a loser in another time.

I am going to first try and fix this as-is before I go into the more complex process adding alternative filtering modes.

ifiddes commented 6 years ago

I am attaching a BED. The fix is on branch gene_filtering_bug commit aa33662. If you want, instead of actually running everything again, take a look at this BED. It is the output of the bug fixed version. DEERMOUSE.filtered.bugfix.bed.txt

lassancejm commented 6 years ago

Thanks! I am hoping to have time to give this a serious look next week.

lassancejm commented 5 years ago

Finally had to chance to dive into this and it seems to be working fine, for the most part. The good news is that the filtering seems to be working well. However, some entries do not make it in the consensus annotation, but I am not sure why.

In this example, although there is multiple lines of evidence, some annotations are not present in the uppermost track, which is the consensus. The bottom track is a consensus from a previous version of CAT (Nov 2017), without paralog collapse.

FYI, I replaced two scripts (__init__.py and filter_transmap.py) from the master branch by those in the branch gene_filtering_bug.

image

ifiddes commented 5 years ago

Interesting. This is definitely not desired behavior. I am going to need the reference and target database to figure out why this happened.

On Thu, Dec 20, 2018 at 5:47 AM lassancejm notifications@github.com wrote:

Finally had to chance to dive into this and it seems to be working fine, for the most part. The good news is that the filtering seems to be working well. However, some entries do not make it in the consensus annotation, but I am not sure why.

In this example, although there is multiple lines of evidence, some annotations are not present in the uppermost track, which is the consensus.

FYI, I replaced two scripts (init.py and filter_transmap.py) from the master branch by those in the branch gene_filtering_bug.

[image: image] https://user-images.githubusercontent.com/14118139/50257108-cd584480-03c7-11e9-9a77-5341f2ebff82.png

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/83#issuecomment-449004471, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXf347iNAuBarT1TJw0KNPUWFoX2Pks5u65TXgaJpZM4QcJgI .

-- Ian Fiddes

lassancejm commented 5 years ago

I just sent you a link to these db. Let me know if you need anything else.

lassancejm commented 5 years ago

Here are the coordinates for the region shown above, in case it helps to have an example:

chr1 95573429 95657906

Thanks again for looking into this!

lassancejm commented 5 years ago

Yep, was wondering if you had a chance to look into the files I shared with you?

ifiddes commented 5 years ago

Sorry, I just had a chance to. Something appears to be going wrong in the pairwise alignments that BLAT is generating for these genes that are causing them to drop out.

Focusing on Olfr66, I found that transMap mapped it over with 78% coverage and 72% identity:

sqlite> select * from TransMapEvaluation where TranscriptId == 'ENSMUST00000216303.1';
GeneId|TranscriptId|AlignmentId|classifier|value
ENSMUSG00000058200.2|ENSMUST00000216303.1|ENSMUST00000216303.1-1|AlnExtendsOffContig|0.0
ENSMUSG00000058200.2|ENSMUST00000216303.1|ENSMUST00000216303.1-1|AlnPartialMap|1.0
ENSMUSG00000058200.2|ENSMUST00000216303.1|ENSMUST00000216303.1-1|AlnAbutsUnknownBases|0.0
ENSMUSG00000058200.2|ENSMUST00000216303.1|ENSMUST00000216303.1-1|PercentN|0.0
ENSMUSG00000058200.2|ENSMUST00000216303.1|ENSMUST00000216303.1-1|TransMapCoverage|78.312
ENSMUSG00000058200.2|ENSMUST00000216303.1|ENSMUST00000216303.1-1|TransMapIdentity|72.318
ENSMUSG00000058200.2|ENSMUST00000216303.1|ENSMUST00000216303.1-1|TransMapGoodness|78.1
ENSMUSG00000058200.2|ENSMUST00000216303.1|ENSMUST00000216303.1-1|TransMapOriginalIntronsPercent|40.0
ENSMUSG00000058200.2|ENSMUST00000216303.1|ENSMUST00000216303.1-1|Synteny|6.0
ENSMUSG00000058200.2|ENSMUST00000216303.1|ENSMUST00000216303.1-1|ValidStart|1.0
ENSMUSG00000058200.2|ENSMUST00000216303.1|ENSMUST00000216303.1-1|ValidStop|1.0
ENSMUSG00000058200.2|ENSMUST00000216303.1|ENSMUST00000216303.1-1|ProperOrf|0.0

But when I look in the BLAT alignments that are used to pick the best version, none of them are in the database:

sqlite> select * from mRNA_transMap_Metrics where AlignmentId like '%ENSMUST00000216303.1%';
sqlite> select * from CDS_transMap_Metrics where AlignmentId like '%ENSMUST00000216303.1%';
sqlite> select * from mRNA_augTMR_Metrics where AlignmentId like '%ENSMUST00000216303.1%';
sqlite> select * from CDS_augTMR_Metrics where AlignmentId like '%ENSMUST00000216303.1%';

So something went wrong in the align_transcripts module it seems. This is troubling because I rely on these BLAT alignments to give an even playing field to augustus transcripts vs. transMap transcripts in a given locus. Given that this hasn't been a problem before, I am wondering if BLAT failed for valid reasons here given that the transMap identity is so low.

I plotted the transMap alignment identity across the entire dataset:

image

Then compared this to CDS-only and full mRNA BLAT alignments across the entire dataset:

image image

You can see that these genes would be on the tail of those distributions, but not a outlier. So, I am not entirely sure why BLAT failed entirely to return an alignment.

Questions: 1) If you extract that gene in your deermouse genome and BLAT it against mm10 on the browser, what happens? 2) If you send me the FASTA, reference genePred and transMap genePred I can try to reproduce what CAT is doing and see why command-line BLAT responds differently than the browser (which often happens actually, the browser version has fancy smart parameterization).

Failing all else, I could re-write this logic so that even transcripts that fail pairwise alignment can get through. In that case transMap would always be chosen over augustus, but that could be fine. When I implemented it the way it is now, my reasoning was that Cactus can sometimes produce very spurious alignments that would get dropped by the requirement that the transcript-level pairwise alignments succeed. However, your locus looks great from a syntenic point of view, and seems correct to me at first glance.

lassancejm commented 5 years ago

Hi Ian,

Thanks for looking into this issue. Apparently, a cleanup of the filesystem where I am storing the CAT intermediate files took place, so I have had to restart the pipeline.

Questions:

If you extract that gene in your deermouse genome and BLAT it against mm10 on the browser, what happens?

I ran blat against mm10 on the browser using the corresponding genomic region and am attaching the results. Basically, best hit is the region corresponding to Olfr66. Then, there are two hits corresponding to the neighboring paralogs (CDS only apparently).

image

blat.out.psl.txt olfr66-like.fasta.txt

If you send me the FASTA, reference genePred and transMap genePred I can try to reproduce what CAT is doing and see why command-line BLAT responds differently than the browser (which often happens actually, the browser version has fancy smart parameterization).

Happy to share whatever you need. Which fasta do you mean? The reference's?

Failing all else, I could re-write this logic so that even transcripts that fail pairwise alignment can get through. In that case transMap would always be chosen over augustus, but that could be fine. When I implemented it the way it is now, my reasoning was that Cactus can sometimes produce very spurious alignments that would get dropped by the requirement that the transcript-level pairwise alignments succeed. However, your locus looks great from a syntenic point of view, and seems correct to me at first glance.

I am hoping this does not have to happen.

ifiddes commented 5 years ago

I need the fasta for the deer mouse, as well as the genePreds so that I can extract the transcript sequences and do the commandline pairwise blat myself to see what the alignments look like. I can get mm10 myself, so just send the assembly and the reference and target (transMap) genePred files. Thanks!

Ian Fiddes

On Jan 14, 2019, at 8:37 AM, lassancejm notifications@github.com wrote:

Hi Ian,

Thanks for looking into this issue. Apparently, a cleanup of the filesystem where I am storing the CAT intermediate files took place, so I have had to restart the pipeline.

Questions:

If you extract that gene in your deermouse genome and BLAT it against mm10 on the browser, what happens?

I ran blat against mm10 on the browser using the corresponding genomic region and am attaching the results. Basically, best hit is the region corresponding to Olfr66. Then, there are two hits corresponding to the neighboring paralogs (CDS only apparently).

blat.out.psl.txt olfr66-like.fasta.txt

If you send me the FASTA, reference genePred and transMap genePred I can try to reproduce what CAT is doing and see why command-line BLAT responds differently than the browser (which often happens actually, the browser version has fancy smart parameterization).

Happy to share whatever you need. Which fasta do you mean? The reference's?

Failing all else, I could re-write this logic so that even transcripts that fail pairwise alignment can get through. In that case transMap would always be chosen over augustus, but that could be fine. When I implemented it the way it is now, my reasoning was that Cactus can sometimes produce very spurious alignments that would get dropped by the requirement that the transcript-level pairwise alignments succeed. However, your locus looks great from a syntenic point of view, and seems correct to me at first glance.

I am hoping this does not have to happen.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

lassancejm commented 5 years ago

Link sent privately, thanks! JM

ifiddes commented 5 years ago

I have the data, so far I am not able to reproduce this problem by hand. I am getting a reasonable alignment out of BLAT when I use these...

Can you also send me the PSL in transcript_alignment folder? transcript_alignment/DEERMOUSE.transMap.mRNA.psl so that I can see if when I run it by hand I get the same outputs?

lassancejm commented 5 years ago

Unfortunately, I am still waiting for the transcript_alignment folder and its content to be re-generated. Will share as soon as I have it. I apologize for the delay.

lassancejm commented 5 years ago

The folder has not been generated yet (CAT is running Augustus), is it expected? WHen should it be created and populated?

ifiddes commented 5 years ago

The alignment stage doesn’t run until after augustusTMR runs (because it aligns those as well).

If you don’t want to run it all again, you could run the pipeline without —Augustus. You can always come back later and run with it by just adding the —rebuild-consensus flag.

On Fri, Jan 18, 2019 at 7:32 AM lassancejm notifications@github.com wrote:

The folder has not been generated yet (CAT is running Augustus), is it expected? WHen should it be created and populated?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/ComparativeGenomicsToolkit/Comparative-Annotation-Toolkit/issues/83#issuecomment-455585844, or mute the thread https://github.com/notifications/unsubscribe-auth/AHdLXRCumK37QmsE6dcJEI-wmhXMu1ddks5vEekFgaJpZM4QcJgI .

-- Ian Fiddes

lassancejm commented 5 years ago

I let the pipeline run to completion with Augustus step included and it finally finished yesterday. The good news is that the missing annotations are now included. The bad news is that I think I made you waste time on this. I am not sure what happened exactly, but it seems that having to restart CAT more or less from scratch solved the problem. I am really sorry about that.

ifiddes commented 5 years ago

Interesting! No worries on my end, I am glad it worked out, and slightly worried about inconsistent behavior. Which annotation set did the successful copies come from?

lassancejm commented 5 years ago

As far as I can tell, it is variable (transmap, augTM, augTMR, etc.), which is a good thing. I guess the lesson is that I should be more careful about the files that I can really keep around when restarting the pipeline (only thing I had left now where the hints and chaining files).

ifiddes commented 5 years ago

Ya CAT is not as robust to this as I wanted it to be. If I had to guess, the transcript alignment step did something weird in the past, and the tables lived on in your databases, and didn't get properly overwritten because they still had their completion flag and the logic that is supposed to trigger a restart isn't quite robust enough. I will take a look at the DB you sent me previously, the timestamps might be informative and help me figure out what went wrong.

lassancejm commented 5 years ago

Sounds good.