Comparing merged class against class found in Monnahan et_al

The manuscript Using multiple reference genomes to identify and resolve annotation inconsistencies is a recent publication by Monnahan et al from 2020. It details a clever approach which I appreciate which attempts to identify potentially split genes in the maize genome.

It does so by comparing multiple versions of maize genome annotations (B73 - Ph207 - W22) against one another, and compares them in a pairwise manner to identify genes with a "One gene hits many in another" genome approach. This can be somewhat seen in their figure 1.

The manuscript goes on to basically test whether these sets of genes should be merged - or kept as split genes. Asking the question - "which annotation is correct?" It does so by looking at RNA-seq data across multiple different tissue types which asking whether the 'split' genes involved are actually similarly expressed. This is summed in their metric M2f (Mean 2 fold expression difference) - Again figure 2 here:

It's definitely an interesting approach and one I appreciate. Something to note here though is that the cut off they're using - this "top 10 of m2f" metric is very stringent (which makes sense). Basically if the values of M2F for one of their calculated pairs doesn't fall on either side (the split or merged side) of the distribution, they do not "make a call" as it were. So this does mitigate conclusions they can make about some of the potential pairs they have. For instance - initially they identify 481 potential mergers in B73 - and end up identifying 96 potential mergers, 170 genes which should remain split, and 240 which were unable to be called (this adds up to 506??).

Anyways - with this all in mind, I'm interested in identifying the final set of mergers and splits, and seeing what proportion we agree/disagree on, and what we can say about these regions.

Pulling out Correct B73 Annotation Calls

I went ahead and dowloaded their file with the "Supported annotations according to our M2f procedure (Additonal File 10)" to my local computer directory here: /Users/feilab/Desktop/Monnahan_et_al/Monnahan_et_al_validated_merged_genes.txt

File looks like this: Screen Shot 2021-02-10 at 10 27 42 AM

The only way you're able to decipher the comparison being made here (this contains merge/split pairs from all combination of genomes) is by looking at the sequence ID name. B73 had the identifiers like: Zm00001d010484 where the 1d is the identification of B73.

With this in mind I'll need to filter all merged genes to those focusing on the B73 set. To do so, I ran a simple awk command looking for 1d in iether column 1 or column 2. Command" ❯ awk '$2 ~ /1d/ || $1 ~ /1d/ {print $0}' Monnahan_et_al_validated_merged_genes.txt | sort > Monnahan_et_al_validated_B73_genes.txt

When looking at the number of genes here you see something somewhat odd:

❯ awk '$2 ~ /1d/ || $1 ~ /1d/  {print $0}' Monnahan_et_al_validated_merged_genes.txt | sort | uniq | wc -l
    1443

1443 isn' t what I was expecting. But this could be a trimmed down number to the 1383 mentioned in the manuscript. Quote:

Considering these split-genes along with the merged genes to which they corresponded, our analysis concerns 1275, 1383, and 2125 genes in the W22, B73, and PH207 annotations, respectively, corresponding to roughly 3–5% of all genes contained in these annotations.

Which would make sense because looking at the file - when pulling B73 specific annotations quite a few have the following values

Merged Splits M2f Call
Zm00004b039508 Zm00001d025384,Zm00001d025385 NA NA
Zm00004b039512 Zm00001d025395,Zm00001d025397 NA NA
Zm00004b040088 Zm00001d026086,Zm00001d026087 NA NA
Zm00004b040104 Zm00001d026105,Zm00001d026106 NA NA
Zm00004b040200 Zm00001d026220,Zm00001d026221 NA NA

To get at the numbers of each annotation class they generated (NoCall, Merged, Split,NA) ran the following command and got the following values:

~/Desktop/Monnahan_et_al
❯ awk '{print $4}' Monnahan_et_al_validated_B73_genes.txt | sort | uniq -c
 407 Merged
 232 NA
 552 NoCall
 252 Split

Interesting... However this doesn't really tell us exactly how many of each merger/split class in B73 is actually supported. That "Merged" value above could correspond to a gene which is already Merged in B73 which is supported which is Split in Ph207. So I need to further process this.

Generate then the file of B73 true mergers as well as those splits in B73 which should stay as true splits, as well as those of No Calls and NAs Command:

awk '$2 ~ /1d/ && $4 == "Merged" {print $0}' Monnahan_et_al_validated_B73_genes.txt > Monnahan_et_al_validated_B73_genes.merged.txt 

 awk '$2 ~ /1d/ && $4 == "Split" {print $0}' Monnahan_et_al_validated_B73_genes.txt > Monnahan_et_al_validated_B73_genes.kept_split.txt

❯ awk '$2 ~ /1d/ && $4 == "NA" {print $0}' Monnahan_et_al_validated_B73_genes.txt > Monnahan_et_al_validated_B73_genes.NA.txt

❯ awk '$2 ~ /1d/ && $4 == "NoCall" {print $0}' Monnahan_et_al_validated_B73_genes.txt > Monnahan_et_al_validated_B73_genes.NoCall.txt

❯ wc -l  Monnahan_et_al_validated_B73_genes.*
     141 Monnahan_et_al_validated_B73_genes.NA.txt
     240 Monnahan_et_al_validated_B73_genes.NoCall.txt
     170 Monnahan_et_al_validated_B73_genes.kept_split.txt
      96 Monnahan_et_al_validated_B73_genes.merged.txt
    1443 Monnahan_et_al_validated_B73_genes.txt
    2090 total

So this is the set we actually want. This is the set of genes found in B73 which were thought to be needing either merger, or to remain a split. Excellent. From these we can start figuring out which of these intersect with my calls, and are maybe different, the same, or if my calls can potentially improve on their "No call" class.

Pulling My annotations

Generated my own set of passing genes using the following command: awk '{print $4}' merged_original.bed > Mendieta_et_al.passing.txt and put this in a sub directory named Mendieta. This is the list of all gene mergers which I found. I'll go ahead and grep these gene names later to see what proportion of identical genes is represented in this manuscript as compared to mine.

Find intersection

Grep the gene names which i have found in each of the subgroups.

parallel "rg -f Mendieta/Mendieta_et_al.passing.txt {} > {.}.Mendieta_intersect.txt" ::: Monnahan_et_al_validated_B73*

Count the number of intersections we have:

❯ wc -l *Mendieta_intersect.txt
       0 Monnahan_et_al_validated_B73_genes.NA.Mendieta_intersect.txt
      60 Monnahan_et_al_validated_B73_genes.NoCall.Mendieta_intersect.txt
      15 Monnahan_et_al_validated_B73_genes.kept_split.Mendieta_intersect.txt
      34 Monnahan_et_al_validated_B73_genes.merged.Mendieta_intersect.txt
     109 total

Interesting. We're able to make 60 more IDs in their no call section. But we're only capturing about 34 of their identified mergers. Begs the question why not more? Lets get the list of genes which we did NOT identify. We'll use a similar command as above but use "inverse grep"

parallel "rg -v -f Mendieta/Mendieta_et_al.passing.txt {} > {.}.Not_Mendieta_intersect.txt" ::: Monnahan_et_al_validated_B73*

With the output:

❯ wc -l *Mendieta_intersect.txt
     141 Monnahan_et_al_validated_B73_genes.NA.Not_Mendieta_intersect.txt
     180 Monnahan_et_al_validated_B73_genes.NoCall.Not_Mendieta_intersect.txt
     155 Monnahan_et_al_validated_B73_genes.kept_split.Not_Mendieta_intersect.txt
      62 Monnahan_et_al_validated_B73_genes.merged.Not_Mendieta_intersect.txt
     538 total

Now I'm going to have to go through the browser and start looking over some of these pairs to get a sense of what's going on.

Looking At the Merger Class they Have and I missed

Some of these don't look great. When looking at the ChIP-seq data - it's clear that some of these "Mergers" are really two distinct genes. Look below these two genes. ID'd on line:Zm00004b030968 Zm00001d038672,Zm00001d038673 0.551697212524484 Merged

Clearly these are two seperate genes based of unique aligned chip-seq reads.

Screen Shot 2021-02-10 at 11 58 49 AM

Again - clear these are seperate genes: Zm00004b030968 Zm00001d038672,Zm00001d038673 0.551697212524484 Merged

Screen Shot 2021-02-10 at 12 04 57 PM

Other regions it makes sense why we couldn't say anything about these loci. Poor mappability and very short genes appears to be one of the largest issues here:

Screen Shot 2021-02-10 at 12 00 53 PM

Screen Shot 2021-02-10 at 12 03 46 PM

So overall - I think it's clear that they're able to call regions as mergers which I can't due to mappability. This is excellent as my method will simply not be able to pick these apart. On the other hand, my method does present some strikingly clear evidence that some of these "mergers" should not be as such. Rather they're likely recenlty evolvled tandem duplicates. Their method since it's utilizing RNA-seq would basically make the assumption that since they're have similar expression - they are likely a single gene. This is... Not right.

Recent duplications are still duplications. Another problem area they have here... Screen Shot 2021-02-10 at 12 10 35 PM

And other regions which I find somewhat interesting... All super small genes with an abundance of H3K4me3 Screen Shot 2021-02-10 at 12 13 05 PM

The Actual Numbers

So, with this all said and done - what are the numbers reported? They are....\

In total we interscted with 34/96 or the predicted merged annotations in Monnahan et al.

Futher, we were able to add evidence to 60 merged regions from Monnahan et al which they were unable to due to evidence based cut offs.

And in total there were 15 disagreements between the Monnahan et all dataset and our dataset. Indicating that these may be in fact genes which neec to be split, but should not be.

So in total 143 of out annotations intersected.

Additionally of the renaming merged gene intersect our annotation results are either unable to intersect these regions as they don't have appropriate chip enrichment (K36me3 and K4me1 required) or they actually disagree with the functional Assay chip-seq presents. Probably won't mention this in the manuscript.

Looking at Mappability of non-caught gene class

Something I need to investigate quickly is the reason 64/96 genes were not caught. Based off the above information it really appears that these genes fall into low mappability regions. This is an easy thing to test considering I have the mappability values for all genes in the genome. This should basically be a quick grep command of those regions which intersected, and then taking the mean of these regions.

Command to grab just the names from the intersect regions: awk '{print $2}' Monnahan_et_al_validated_B73_genes.merged.Not_Mendieta_intersect.txt | awk 'BEGIN {FS=","} ; {print $1,"\n",$2}' | sed 's/ //g' > Monnahan_et_al_validated_B73_genes.merged.Not_Mendieta_intersect.gene_names_only.txt

Then - using ripgrep - grab the associated gene names:

❯ rg -f Monnahan_et_al_validated_B73_genes.merged.Not_Mendieta_intersect.gene_names_only.txt /Users/feilab/Projects/03.ncRNA_project/03.ncRNA_project/02.Analysis/lncRNA_copy_files/2020-05-20_mappability_scores_all_genes/genome_annotation_mappability.values.bed > Monnahan_et_al_validated_B73_genes.merged.Not_Mendieta_intersect.mappability_values.bed

Also need to generate a quick file to keep the genes in pair format for later analysis. What if consistenly one of the gene pairs is poorly mappable while the other isn't? We could lose this information.

❯ awk '{print $2}' Monnahan_et_al_validated_B73_genes.merged.Not_Mendieta_intersect.txt | awk 'BEGIN {FS=","} ; {print $1,$2} ; {OFS="\t"}' | sed 's/ //g'> Monnahan_et_al_validated_B73_genes.merged.Not_Mendieta_intersect.paired_gene_names_only.txt

Huh - interesting. When plotting the mappability of non-caught gene merger pairs, it appears that one is almost always less mappable then the other.

Chromatin intersections

Decided to brute force this as well and intersect all merger classes which I do NOT agree with the chromatin modifications I had in leaf. Command:

❯ bedtools intersect -a Monnahan_et_al_validated_B73_genes.merged.Not_Mendieta_intersect.mappability_values.bed -b /Users/feilab/Projects/03.ncRNA_project/03.Figures/Figure2/00.data/histone_mods/tis_leaf_mod_H3K* -wa | sort | uniq > Monnahan_et_al_validated_B73_genes.merged.Not_Mendieta_intersect.chrom_interset.bed

And went through every pair, and marked on the supplamental whether I disagreed with the call, as they're likely tandem duplicates, or if I didnt' have enough information to make the call (Missing one of our mark types 'No call'). Appended this to the supplament.

SVG location for images:

Agree: 8:79,610,450-79,700,278 1:144,830,279-144,888,748

Disagree: 7:1397571..1414180 3:227493188..227499857 (6.67 Kb)

Additiona info from out methods: 1:123,373,180-123,465,479

Jome0169 / MendietaPablo_Annotation_Paper_scripts