Scan genome for missing annotations

iamcorrinne commented 7 years ago

Justin and I are looking over the results, and we have a question. I think that the scan is good, and doesn't need repeated. What we are interested in is what is going on with G. kirkii. It seems that there is a fair amount of "stuff" in Gk that was unannotated. For example, 16% of Kokia "gains" were found in the unannotated fraction of Gk, and 68% of Gk "losses" were recovered from the unannotated fraction.

First question: Adam, when you ran the Gk annotation, how much of the genome was left out of Maker? Everything < 1kb? How much did that total?

iamcorrinne commented 7 years ago

	by clustering	# in Gr	# in other		# in Gr	# in other
Gk gain	211	4	12		2%	6%
Kd gain	394	6	64		2%	16%

	by clustering	in itself		in itself
Gk loss	2144	1465		68%
Kd loss	1458	11		1%

adamthrash commented 7 years ago

There are 793 sequences in G.kirkii.v1.fasta, and only two of those are shorter than 1kb. Those two sequences dropped from kirkii.filtered.fasta and not run through MAKER.

conJUSTover commented 7 years ago

How does that compare to Kokia annotation?

adamthrash commented 7 years ago

Our kokia genome that we used, kokia_final_scaffolds.fa, has 130430 sequences. Of those 115,026 were shorted than 1kb.

For reference, I'm getting these numbers by running samtools faidx on the genome and counting the number of lines to get the total, and running cut -f1-2 index.fa.fai | awk '$2 < 1000' | wc -l to get the number of sequences shorter than 1kb.

iamcorrinne commented 7 years ago

Thanks, Adam. We are looking at these numbers right now and trying to understand them. The number of gains does not bother us too much; it is mostly within what we would expect, although the number of Kokia "gains" that are found in kirkii is a little on the high side (16%), which could be overlooked annotations in kirkii. Even still, this only narrows the rate of gene gain in Kokia vs kirkii from 2x to 1.5x. That's ok.

What we are really struggling with is the fact that 68% of gene "losses" in kirkii are showing up in the unannotated fraction of the kirkii genome. That seems huge! What could cause this? We come up with three possibilities:

overlooked annotations in kirkii (what would be the reason for this? Differences in what is expressed in the RNA-seq libraries? Something else?)
something in the sequence that causes the proteins to appear malformed (i.e., if there are indels in the sequence that disrupt the protein)
1. malformed proteins could be sequencing error
2. malformed proteins could also be biology

Do you guys have any thoughts?

Our thoughts are that we should take the regions that match "losses" in the kirkii v kirkii gff and translate those into protein. Then we can compare the proteins to those found in G. raimondii. If the proteins appear truncated (i.e., premature stop codon) or have a frame-shift, then we know that it is something in the sequence that is causing the problem. If they look good, then there is something wonky in the annotation that we will have to explain....

iamcorrinne commented 7 years ago

This region, for example, is suspect to me

SEQ_595 kirkii mRNA 51895369 51896058 . - . ID=Gorai.010G155900.1.mrna1;Name=Gorai.010G155900.1;Parent=Gorai.010G155900.1.path1;coverage=100.0;identity=96.1;matches=664;mismatches=22;indels=5;unknowns=0

Look at the number of mismatches and indels -- it seems like this protein would not be retained well, and that is what we would like to know. Are these truly overlooked, or is there something in the sequence (biological or technical) that led to this region not being annotated?

conJUSTover commented 7 years ago

Correction on Corrinne's table: For Kd loss, 872 genes were identified in Kd (59.8%). There were 11 genes found in the raimondii genome that share sequence similarity to these genes.

conJUSTover commented 7 years ago

Do we have any more results on if these identifies regions with sequence similarity are actually missed annotations (have protein coding sequence) or are pseudogenes (early stop codon, frameshifts, etc)?

iamcorrinne commented 7 years ago

@jconover11 -- wait, I don't understand. You said 872 genes were Kd loss (first, why is that down from 1458) and then gave a 59.8%...? Can you recopy the table and correct it?

@maa146 are you guys able to look at the proteins that were found in these genomes (in the files Gk.Gk loss, Kd.Kd loss, Gk.Gr gain, Gk.Kd gain, Kd.Gk gain, and Kd.Gr gain) to see if these proteins look "good", or if they are malformed (by stop codons or frameshift mutations, etc).

conJUSTover commented 7 years ago

	by clustering	# in Gr	# in other		# in Gr	# in other
Gk gain	211	4	12		2%	6%
Kd gain	394	6	64		2%	16%

	by clustering	in itself		in itself
Gk loss	2144	1465		68%
Kd loss	1458	872		59.8%

conJUSTover commented 7 years ago

My apologies. @iamcorrinne, I'm not sure where 11 came from. It should be 872. perl -ne 'if(/ID=(.*?).mrna/) {print $1."\n"}' Kokia_loss.kokia.gmap.mRNA.gff | sort | uniq | wc -l 872

iamcorrinne commented 7 years ago

Wow, so there are many that are considered "lost" in the genome where we are finding a suitable candidate. @adamthrash @maa146 : why do you think so many are "missed"? Do you think this is biology or technical errors?

conJUSTover commented 7 years ago

If we can pull the coding sequences out and find their potential amino acid seqs, that could indicate if they are missed annotations or legitimate pseudogenes. I suppose these could also be part of gene families, which tend to be more plastic in their CNV. If they are not annotation errors, is there any way these could be genome-assembly issues?

maa146 commented 7 years ago

@iamcorrinne, your suspicions were correct, at least for the one you pointed out. There are several problems with the Gorai.010G155900 alignment to kirkii. Below is the entire alignment record:

Type	Start	Stop	Length
gene	51895369	51896058	689
mRNA	51895369	51896058	689
exon	51895369	51896058	689
CDS	51895934	51895961	27
CDS	51895916	51895932	16

When the entire exon is encoded it has internal stops:

>Gorai.010G155900.1.mrna1_1 gene=Gorai.010G155900.1 CDS=98-143
MAPREKTAAVKGTGNNNEVHFRGVRKRPWGRYAAEIRDPGKKESCLARNL*YGRGSSQSL
RRRR*GISRRQGQD*FSGPL*EPQQG**L*K*W*QQPP*P*PE*HRGILEPGACSHG*LF
AVRSQPRTR*GGHRLRTYGC*ISFPASFTGALYCGPQSFLL*TVRSA*CSERTPISTVGF
DHHDLHATFNGGVQSDSDSSSVVDLNHHEVKSRPLLNIDLNQPAVPEIA*

When just the CDSs are encoded it's an incomplete protein:

>Gorai.010G155900.1.mrna1 gene=Gorai.010G155900.1
LPRSEIQGKKSRVWL

So, either the 90% identity and 90% coverage is too lenient, or we need to think of a way to remove the incomplete/invalid proteins

maa146 commented 7 years ago

A very basic check shows 477 out of the 1478 "missing" genes are either incomplete (did not start with Met and end with stop) or had internal stops. Below is the code:

gffread -g kirkii.gene_mask.fa -y  -  Kirkii_loss.kirkii.gmap.gff | 
     bp_seqconvert.pl --from fasta --to tab | 
     awk ' $2 !~ /^M[^.]*\.$/ { print $0 } ' | wc -l

iamcorrinne commented 7 years ago

thanks @maa146 . This is somewhat reassuring, although the new numbers are still a little on the high side. The new numbers are now

	by clustering	# in Gr	# in other		# in Gr	# in other
Gk gain	211	4	12		2%	6%
Kd gain	394	6	64		2%	16%

	by clustering	in itself	in itself
Gk loss	2144	1465	68%
Gk loss fixed	2144	1001	46%
Kd loss	1458	872	59.8%

iamcorrinne commented 7 years ago

@maa146 , can you run this for all the comparisons quickly and post the results?

@jconover11 , I think we can just proceed with caveats.

maa146 commented 7 years ago

Below are the number of incomplete/invalid proteins in each comparison:

Comparison	Invalid Count
Kirkii_loss.kokia	14
Kirkii_gain.kokia	10
Kokia_loss.kokia	358
Kokia_gain.kokia	8
Kirkii_loss.kirkii	477
Kirkii_gain.kirkii	10
Kokia_loss.kirkii	82
Kokia_gain.kirkii	31
Kirkii_loss.gorai	15
Kirkii_gain.gorai	4
Kokia_loss.gorai	12
Kokia_gain.gorai	11

Another thing we can check is the protein lengths and alignments since the test above doesn't check function.

iamcorrinne commented 7 years ago

	by clustering	# in Gr	# in other	# in Gr	# in other
Gk gain	211	4	12	2%	6%
Gk gain fixed	211	0	2	0%	0.9%
Kd gain	394	6	64	2%	16%
Kd gain fixed	394	0	33	0%	8%

	by clustering	in itself	in itself
Gk loss	2144	1465	68%
Gk loss fixed	2144	988	46%
Kd loss	1458	872	59.8%
Kd loss fixed	1458	514	35.3%

Okay, so here are our fixed numbers. Maybe we should compare protein similarity for the rest? Or just go with these numbers and list the caveats?

conJUSTover commented 7 years ago

I'm confused what is meant by "fixed" here re: gene loss events. I see 3 bins we can classify inferred gene loss events in:

Inferred, but no sequence similarities found in masked genome
Inferred, but disrupted nucleotide sequence similarity found in masked genome
inferred, but missed annotation in masked genome

Have we differentiated between 2 and 3 yet? Or are these clumped into "fixed" in the table?

In a way, finding incomplete/invalid proteins is supporting our inferences; we found pseudogenes. Am I thinking about this wrong?

conJUSTover commented 7 years ago

@maa146 When writing this up, I realized that this analysis may be somewhat flawed. I sent you the CDS sequence (no introns) for these genes, but we mapped that to the whole genome sequence. Hence, any genes that have introns that comprise >10% of the total gene length will be filtered out from our results and ignored from validation.

Do you think we should lower the %identity threshold (or, if possible, set a lower penalty for insertions specifically), or redo this analysis with the full gene sequence (introns included)? I don't know how divergent the introns would be, so that may complicate matters a bit if we are still relying on a %identity threshold.

maa146 commented 7 years ago

@jconover11: I used gmap to align the CDSs to the genome. It recognizes splice sites and aligns accordingly. I sent the mRNA portion of the alignments because they contained all the pertinent data and some of the full alignments were very large. If needed I can upload the full alignments to the ftp site.

conJUSTover commented 7 years ago

Great, I wasn't aware gmap operated that way. I think this should be fine, then. Was the 90% thresholds applied to the full alignments, or just the mRNA?

maa146 commented 7 years ago

The alignments. So, not including the introns

iamcorrinne commented 7 years ago

are we good on this now, guys?

conJUSTover commented 7 years ago

I think so. I have a summary table saved on my office computer, I'll push it whenever I get back on Monday.

maa146 commented 7 years ago

Received from @jconover11 :

Hi Tony,

Corrinne and I have discussed the Kokia manuscript with Jonathan (It's so close to being done!), and he would like to look for signatures of gene deletion in Kokia and Gossypioides from ~10,000 G.raimondii genes that we didn't look at before. I've attached the CDS sequences here, and would like to run gmap against Kokia and Gossypioides.

To determine if these genes actually belong to other ortholgous groups, it would be nice to not mask the Kokia and Gossypioides genomes before you run gmap this time. And, we would like to validate that any hits we find do not have "good" translations to check for overlooked annotations (just like we did previously).

Let me know if I can help in any way.

Thanks, Justin

maa146 commented 7 years ago

@iamcorrinne and @jconover11 :

The numbers I gave previously include alt. splicing of the same Gorai gene. Below is the corrected table:

Genome	# hits	w/ valid proteins	w/ overlapping annotations	valid and overlapping
kokia	6035	3330	2009	1527
kirkii	6229	3513	1983	1512

The lists of genes for each cell are in analysis/deleted_annotations/ column 2: $db.gmap.hit.lst column 3: $db.gmap.valid_protein.lst column 4: $db.gmap.intersect.lst column 5: $db.gmap.both.lst

conJUSTover commented 7 years ago

Does the "# hits" also include multiple alt. splicing variants? And were alt. splice variants accounted for in the previous run you did for this?

Wendellab / KokiaKirkii

Scan genome for missing annotations #3