Closed iamcorrinne closed 7 years ago
by clustering | # in Gr | # in other | # in Gr | # in other | ||
---|---|---|---|---|---|---|
Gk gain | 211 | 4 | 12 | 2% | 6% | |
Kd gain | 394 | 6 | 64 | 2% | 16% |
by clustering | in itself | in itself | ||
---|---|---|---|---|
Gk loss | 2144 | 1465 | 68% | |
Kd loss | 1458 | 11 | 1% |
There are 793 sequences in G.kirkii.v1.fasta, and only two of those are shorter than 1kb. Those two sequences dropped from kirkii.filtered.fasta and not run through MAKER.
How does that compare to Kokia annotation?
Our kokia genome that we used, kokia_final_scaffolds.fa, has 130430 sequences. Of those 115,026 were shorted than 1kb.
For reference, I'm getting these numbers by running samtools faidx on the genome and counting the number of lines to get the total, and running cut -f1-2 index.fa.fai | awk '$2 < 1000' | wc -l to get the number of sequences shorter than 1kb.
Thanks, Adam. We are looking at these numbers right now and trying to understand them. The number of gains does not bother us too much; it is mostly within what we would expect, although the number of Kokia "gains" that are found in kirkii is a little on the high side (16%), which could be overlooked annotations in kirkii. Even still, this only narrows the rate of gene gain in Kokia vs kirkii from 2x to 1.5x. That's ok.
What we are really struggling with is the fact that 68% of gene "losses" in kirkii are showing up in the unannotated fraction of the kirkii genome. That seems huge! What could cause this? We come up with three possibilities:
Do you guys have any thoughts?
Our thoughts are that we should take the regions that match "losses" in the kirkii v kirkii gff and translate those into protein. Then we can compare the proteins to those found in G. raimondii. If the proteins appear truncated (i.e., premature stop codon) or have a frame-shift, then we know that it is something in the sequence that is causing the problem. If they look good, then there is something wonky in the annotation that we will have to explain....
This region, for example, is suspect to me
SEQ_595 kirkii mRNA 51895369 51896058 . - . ID=Gorai.010G155900.1.mrna1;Name=Gorai.010G155900.1;Parent=Gorai.010G155900.1.path1;coverage=100.0;identity=96.1;matches=664;mismatches=22;indels=5;unknowns=0
Look at the number of mismatches and indels -- it seems like this protein would not be retained well, and that is what we would like to know. Are these truly overlooked, or is there something in the sequence (biological or technical) that led to this region not being annotated?
Correction on Corrinne's table: For Kd loss, 872 genes were identified in Kd (59.8%). There were 11 genes found in the raimondii genome that share sequence similarity to these genes.
Do we have any more results on if these identifies regions with sequence similarity are actually missed annotations (have protein coding sequence) or are pseudogenes (early stop codon, frameshifts, etc)?
@jconover11 -- wait, I don't understand. You said 872 genes were Kd loss (first, why is that down from 1458) and then gave a 59.8%...? Can you recopy the table and correct it?
@maa146 are you guys able to look at the proteins that were found in these genomes (in the files Gk.Gk loss, Kd.Kd loss, Gk.Gr gain, Gk.Kd gain, Kd.Gk gain, and Kd.Gr gain) to see if these proteins look "good", or if they are malformed (by stop codons or frameshift mutations, etc).
by clustering | # in Gr | # in other | # in Gr | # in other | ||
---|---|---|---|---|---|---|
Gk gain | 211 | 4 | 12 | 2% | 6% | |
Kd gain | 394 | 6 | 64 | 2% | 16% |
by clustering | in itself | in itself | ||
---|---|---|---|---|
Gk loss | 2144 | 1465 | 68% | |
Kd loss | 1458 | 872 | 59.8% |
My apologies. @iamcorrinne, I'm not sure where 11 came from. It should be 872.
perl -ne 'if(/ID=(.*?).mrna/) {print $1."\n"}' Kokia_loss.kokia.gmap.mRNA.gff | sort | uniq | wc -l
872
Wow, so there are many that are considered "lost" in the genome where we are finding a suitable candidate. @adamthrash @maa146 : why do you think so many are "missed"? Do you think this is biology or technical errors?
If we can pull the coding sequences out and find their potential amino acid seqs, that could indicate if they are missed annotations or legitimate pseudogenes. I suppose these could also be part of gene families, which tend to be more plastic in their CNV. If they are not annotation errors, is there any way these could be genome-assembly issues?
@iamcorrinne, your suspicions were correct, at least for the one you pointed out. There are several problems with the Gorai.010G155900 alignment to kirkii. Below is the entire alignment record:
Type | Start | Stop | Length |
---|---|---|---|
gene | 51895369 | 51896058 | 689 |
mRNA | 51895369 | 51896058 | 689 |
exon | 51895369 | 51896058 | 689 |
CDS | 51895934 | 51895961 | 27 |
CDS | 51895916 | 51895932 | 16 |
When the entire exon is encoded it has internal stops:
>Gorai.010G155900.1.mrna1_1 gene=Gorai.010G155900.1 CDS=98-143
MAPREKTAAVKGTGNNNEVHFRGVRKRPWGRYAAEIRDPGKKESCLARNL*YGRGSSQSL
RRRR*GISRRQGQD*FSGPL*EPQQG**L*K*W*QQPP*P*PE*HRGILEPGACSHG*LF
AVRSQPRTR*GGHRLRTYGC*ISFPASFTGALYCGPQSFLL*TVRSA*CSERTPISTVGF
DHHDLHATFNGGVQSDSDSSSVVDLNHHEVKSRPLLNIDLNQPAVPEIA*
When just the CDSs are encoded it's an incomplete protein:
>Gorai.010G155900.1.mrna1 gene=Gorai.010G155900.1
LPRSEIQGKKSRVWL
So, either the 90% identity and 90% coverage is too lenient, or we need to think of a way to remove the incomplete/invalid proteins
A very basic check shows 477 out of the 1478 "missing" genes are either incomplete (did not start with Met and end with stop) or had internal stops. Below is the code:
gffread -g kirkii.gene_mask.fa -y - Kirkii_loss.kirkii.gmap.gff |
bp_seqconvert.pl --from fasta --to tab |
awk ' $2 !~ /^M[^.]*\.$/ { print $0 } ' | wc -l
thanks @maa146 . This is somewhat reassuring, although the new numbers are still a little on the high side. The new numbers are now
by clustering | # in Gr | # in other | # in Gr | # in other | ||
---|---|---|---|---|---|---|
Gk gain | 211 | 4 | 12 | 2% | 6% | |
Kd gain | 394 | 6 | 64 | 2% | 16% |
by clustering | in itself | in itself | ||
---|---|---|---|---|
Gk loss | 2144 | 1465 | 68% | |
Gk loss fixed | 2144 | 1001 | 46% | |
Kd loss | 1458 | 872 | 59.8% |
@maa146 , can you run this for all the comparisons quickly and post the results?
@jconover11 , I think we can just proceed with caveats.
Below are the number of incomplete/invalid proteins in each comparison:
Comparison | Invalid Count |
---|---|
Kirkii_loss.kokia | 14 |
Kirkii_gain.kokia | 10 |
Kokia_loss.kokia | 358 |
Kokia_gain.kokia | 8 |
Kirkii_loss.kirkii | 477 |
Kirkii_gain.kirkii | 10 |
Kokia_loss.kirkii | 82 |
Kokia_gain.kirkii | 31 |
Kirkii_loss.gorai | 15 |
Kirkii_gain.gorai | 4 |
Kokia_loss.gorai | 12 |
Kokia_gain.gorai | 11 |
Another thing we can check is the protein lengths and alignments since the test above doesn't check function.
by clustering | # in Gr | # in other | # in Gr | # in other | ||
---|---|---|---|---|---|---|
Gk gain | 211 | 4 | 12 | 2% | 6% | |
Gk gain fixed | 211 | 0 | 2 | 0% | 0.9% | |
Kd gain | 394 | 6 | 64 | 2% | 16% | |
Kd gain fixed | 394 | 0 | 33 | 0% | 8% |
by clustering | in itself | in itself | ||
---|---|---|---|---|
Gk loss | 2144 | 1465 | 68% | |
Gk loss fixed | 2144 | 988 | 46% | |
Kd loss | 1458 | 872 | 59.8% | |
Kd loss fixed | 1458 | 514 | 35.3% |
Okay, so here are our fixed numbers. Maybe we should compare protein similarity for the rest? Or just go with these numbers and list the caveats?
I'm confused what is meant by "fixed" here re: gene loss events. I see 3 bins we can classify inferred gene loss events in:
Have we differentiated between 2 and 3 yet? Or are these clumped into "fixed" in the table?
In a way, finding incomplete/invalid proteins is supporting our inferences; we found pseudogenes. Am I thinking about this wrong?
@maa146 When writing this up, I realized that this analysis may be somewhat flawed. I sent you the CDS sequence (no introns) for these genes, but we mapped that to the whole genome sequence. Hence, any genes that have introns that comprise >10% of the total gene length will be filtered out from our results and ignored from validation.
Do you think we should lower the %identity threshold (or, if possible, set a lower penalty for insertions specifically), or redo this analysis with the full gene sequence (introns included)? I don't know how divergent the introns would be, so that may complicate matters a bit if we are still relying on a %identity threshold.
@jconover11: I used gmap to align the CDSs to the genome. It recognizes splice sites and aligns accordingly. I sent the mRNA portion of the alignments because they contained all the pertinent data and some of the full alignments were very large. If needed I can upload the full alignments to the ftp site.
Great, I wasn't aware gmap operated that way. I think this should be fine, then. Was the 90% thresholds applied to the full alignments, or just the mRNA?
The alignments. So, not including the introns
are we good on this now, guys?
I think so. I have a summary table saved on my office computer, I'll push it whenever I get back on Monday.
Received from @jconover11 :
Hi Tony,
Corrinne and I have discussed the Kokia manuscript with Jonathan (It's so close to being done!), and he would like to look for signatures of gene deletion in Kokia and Gossypioides from ~10,000 G.raimondii genes that we didn't look at before. I've attached the CDS sequences here, and would like to run gmap against Kokia and Gossypioides.
To determine if these genes actually belong to other ortholgous groups, it would be nice to not mask the Kokia and Gossypioides genomes before you run gmap this time. And, we would like to validate that any hits we find do not have "good" translations to check for overlooked annotations (just like we did previously).
Let me know if I can help in any way.
Thanks, Justin
@iamcorrinne and @jconover11 :
The numbers I gave previously include alt. splicing of the same Gorai gene. Below is the corrected table:
Genome | # hits | w/ valid proteins | w/ overlapping annotations | valid and overlapping |
---|---|---|---|---|
kokia | 6035 | 3330 | 2009 | 1527 |
kirkii | 6229 | 3513 | 1983 | 1512 |
The lists of genes for each cell are in analysis/deleted_annotations/ column 2: $db.gmap.hit.lst column 3: $db.gmap.valid_protein.lst column 4: $db.gmap.intersect.lst column 5: $db.gmap.both.lst
Does the "# hits" also include multiple alt. splicing variants? And were alt. splice variants accounted for in the previous run you did for this?
Justin and I are looking over the results, and we have a question. I think that the scan is good, and doesn't need repeated. What we are interested in is what is going on with G. kirkii. It seems that there is a fair amount of "stuff" in Gk that was unannotated. For example, 16% of Kokia "gains" were found in the unannotated fraction of Gk, and 68% of Gk "losses" were recovered from the unannotated fraction.
First question: Adam, when you ran the Gk annotation, how much of the genome was left out of Maker? Everything < 1kb? How much did that total?