Closed Jome0169 closed 3 years ago
Anyways - with this all in mind, I'm interested in identifying the final set of mergers and splits, and seeing what proportion we agree/disagree on, and what we can say about these regions.
I went ahead and dowloaded their file with the "Supported annotations according to our M2f procedure (Additonal File 10)" to my local computer directory here: /Users/feilab/Desktop/Monnahan_et_al/Monnahan_et_al_validated_merged_genes.txt
File looks like this:
The only way you're able to decipher the comparison being made here (this contains merge/split pairs from all combination of genomes) is by looking at the sequence ID name. B73 had the identifiers like: Zm00001d010484
where the 1d
is the identification of B73.
With this in mind I'll need to filter all merged genes to those focusing on the B73 set. To do so, I ran a simple awk command looking for 1d
in iether column 1 or column 2. Command" ❯ awk '$2 ~ /1d/ || $1 ~ /1d/ {print $0}' Monnahan_et_al_validated_merged_genes.txt | sort > Monnahan_et_al_validated_B73_genes.txt
When looking at the number of genes here you see something somewhat odd:
❯ awk '$2 ~ /1d/ || $1 ~ /1d/ {print $0}' Monnahan_et_al_validated_merged_genes.txt | sort | uniq | wc -l
1443
1443 isn' t what I was expecting. But this could be a trimmed down number to the 1383
mentioned in the manuscript. Quote:
Considering these split-genes along with the merged genes to which they corresponded, our analysis concerns 1275, 1383, and 2125 genes in the W22, B73, and PH207 annotations, respectively, corresponding to roughly 3–5% of all genes contained in these annotations.
Which would make sense because looking at the file - when pulling B73 specific annotations quite a few have the following values
Merged Splits M2f Call
Zm00004b039508 Zm00001d025384,Zm00001d025385 NA NA
Zm00004b039512 Zm00001d025395,Zm00001d025397 NA NA
Zm00004b040088 Zm00001d026086,Zm00001d026087 NA NA
Zm00004b040104 Zm00001d026105,Zm00001d026106 NA NA
Zm00004b040200 Zm00001d026220,Zm00001d026221 NA NA
To get at the numbers of each annotation class they generated (NoCall, Merged, Split,NA) ran the following command and got the following values:
~/Desktop/Monnahan_et_al
❯ awk '{print $4}' Monnahan_et_al_validated_B73_genes.txt | sort | uniq -c
407 Merged
232 NA
552 NoCall
252 Split
Interesting... However this doesn't really tell us exactly how many of each merger/split class in B73 is actually supported. That "Merged" value above could correspond to a gene which is already Merged in B73 which is supported which is Split in Ph207. So I need to further process this.
Generate then the file of B73 true mergers as well as those splits in B73 which should stay as true splits, as well as those of No Calls and NAs Command:
awk '$2 ~ /1d/ && $4 == "Merged" {print $0}' Monnahan_et_al_validated_B73_genes.txt > Monnahan_et_al_validated_B73_genes.merged.txt
awk '$2 ~ /1d/ && $4 == "Split" {print $0}' Monnahan_et_al_validated_B73_genes.txt > Monnahan_et_al_validated_B73_genes.kept_split.txt
❯ awk '$2 ~ /1d/ && $4 == "NA" {print $0}' Monnahan_et_al_validated_B73_genes.txt > Monnahan_et_al_validated_B73_genes.NA.txt
❯ awk '$2 ~ /1d/ && $4 == "NoCall" {print $0}' Monnahan_et_al_validated_B73_genes.txt > Monnahan_et_al_validated_B73_genes.NoCall.txt
❯ wc -l Monnahan_et_al_validated_B73_genes.*
141 Monnahan_et_al_validated_B73_genes.NA.txt
240 Monnahan_et_al_validated_B73_genes.NoCall.txt
170 Monnahan_et_al_validated_B73_genes.kept_split.txt
96 Monnahan_et_al_validated_B73_genes.merged.txt
1443 Monnahan_et_al_validated_B73_genes.txt
2090 total
So this is the set we actually want. This is the set of genes found in B73 which were thought to be needing either merger, or to remain a split. Excellent. From these we can start figuring out which of these intersect with my calls, and are maybe different, the same, or if my calls can potentially improve on their "No call" class.
Generated my own set of passing genes using the following command: awk '{print $4}' merged_original.bed > Mendieta_et_al.passing.txt
and put this in a sub directory named Mendieta
. This is the list of all gene mergers which I found. I'll go ahead and grep these gene names later to see what proportion of identical genes is represented in this manuscript as compared to mine.
Grep the gene names which i have found in each of the subgroups.
parallel "rg -f Mendieta/Mendieta_et_al.passing.txt {} > {.}.Mendieta_intersect.txt" ::: Monnahan_et_al_validated_B73*
Count the number of intersections we have:
❯ wc -l *Mendieta_intersect.txt
0 Monnahan_et_al_validated_B73_genes.NA.Mendieta_intersect.txt
60 Monnahan_et_al_validated_B73_genes.NoCall.Mendieta_intersect.txt
15 Monnahan_et_al_validated_B73_genes.kept_split.Mendieta_intersect.txt
34 Monnahan_et_al_validated_B73_genes.merged.Mendieta_intersect.txt
109 total
Interesting. We're able to make 60 more IDs in their no call section. But we're only capturing about 34 of their identified mergers. Begs the question why not more? Lets get the list of genes which we did NOT identify. We'll use a similar command as above but use "inverse grep"
parallel "rg -v -f Mendieta/Mendieta_et_al.passing.txt {} > {.}.Not_Mendieta_intersect.txt" ::: Monnahan_et_al_validated_B73*
With the output:
❯ wc -l *Mendieta_intersect.txt
141 Monnahan_et_al_validated_B73_genes.NA.Not_Mendieta_intersect.txt
180 Monnahan_et_al_validated_B73_genes.NoCall.Not_Mendieta_intersect.txt
155 Monnahan_et_al_validated_B73_genes.kept_split.Not_Mendieta_intersect.txt
62 Monnahan_et_al_validated_B73_genes.merged.Not_Mendieta_intersect.txt
538 total
Now I'm going to have to go through the browser and start looking over some of these pairs to get a sense of what's going on.
Some of these don't look great. When looking at the ChIP-seq data - it's clear that some of these "Mergers" are really two distinct genes. Look below these two genes. ID'd on line:Zm00004b030968 Zm00001d038672,Zm00001d038673 0.551697212524484 Merged
Clearly these are two seperate genes based of unique aligned chip-seq reads.
Again - clear these are seperate genes: Zm00004b030968 Zm00001d038672,Zm00001d038673 0.551697212524484 Merged
Other regions it makes sense why we couldn't say anything about these loci. Poor mappability and very short genes appears to be one of the largest issues here:
So overall - I think it's clear that they're able to call regions as mergers which I can't due to mappability. This is excellent as my method will simply not be able to pick these apart. On the other hand, my method does present some strikingly clear evidence that some of these "mergers" should not be as such. Rather they're likely recenlty evolvled tandem duplicates. Their method since it's utilizing RNA-seq would basically make the assumption that since they're have similar expression - they are likely a single gene. This is... Not right.
Recent duplications are still duplications. Another problem area they have here...
And other regions which I find somewhat interesting... All super small genes with an abundance of H3K4me3
So, with this all said and done - what are the numbers reported? They are....\
In total we interscted with 34/96 or the predicted merged annotations in Monnahan et al.
Futher, we were able to add evidence to 60 merged regions from Monnahan et al which they were unable to due to evidence based cut offs.
And in total there were 15 disagreements between the Monnahan et all dataset and our dataset. Indicating that these may be in fact genes which neec to be split, but should not be.
So in total 143 of out annotations intersected.
Additionally of the renaming merged gene intersect our annotation results are either unable to intersect these regions as they don't have appropriate chip enrichment (K36me3 and K4me1 required) or they actually disagree with the functional Assay chip-seq presents. Probably won't mention this in the manuscript.
Something I need to investigate quickly is the reason 64/96 genes were not caught. Based off the above information it really appears that these genes fall into low mappability regions. This is an easy thing to test considering I have the mappability values for all genes in the genome. This should basically be a quick grep command of those regions which intersected, and then taking the mean of these regions.
Command to grab just the names from the intersect regions:
awk '{print $2}' Monnahan_et_al_validated_B73_genes.merged.Not_Mendieta_intersect.txt | awk 'BEGIN {FS=","} ; {print $1,"\n",$2}' | sed 's/ //g' > Monnahan_et_al_validated_B73_genes.merged.Not_Mendieta_intersect.gene_names_only.txt
Then - using ripgrep - grab the associated gene names:
❯ rg -f Monnahan_et_al_validated_B73_genes.merged.Not_Mendieta_intersect.gene_names_only.txt /Users/feilab/Projects/03.ncRNA_project/03.ncRNA_project/02.Analysis/lncRNA_copy_files/2020-05-20_mappability_scores_all_genes/genome_annotation_mappability.values.bed > Monnahan_et_al_validated_B73_genes.merged.Not_Mendieta_intersect.mappability_values.bed
Also need to generate a quick file to keep the genes in pair format for later analysis. What if consistenly one of the gene pairs is poorly mappable while the other isn't? We could lose this information.
❯ awk '{print $2}' Monnahan_et_al_validated_B73_genes.merged.Not_Mendieta_intersect.txt | awk 'BEGIN {FS=","} ; {print $1,$2} ; {OFS="\t"}' | sed 's/ //g'> Monnahan_et_al_validated_B73_genes.merged.Not_Mendieta_intersect.paired_gene_names_only.txt
Huh - interesting. When plotting the mappability of non-caught gene merger pairs, it appears that one is almost always less mappable then the other.
Decided to brute force this as well and intersect all merger classes which I do NOT agree with the chromatin modifications I had in leaf. Command:
❯ bedtools intersect -a Monnahan_et_al_validated_B73_genes.merged.Not_Mendieta_intersect.mappability_values.bed -b /Users/feilab/Projects/03.ncRNA_project/03.Figures/Figure2/00.data/histone_mods/tis_leaf_mod_H3K* -wa | sort | uniq > Monnahan_et_al_validated_B73_genes.merged.Not_Mendieta_intersect.chrom_interset.bed
And went through every pair, and marked on the supplamental whether I disagreed with the call, as they're likely tandem duplicates, or if I didnt' have enough information to make the call (Missing one of our mark types 'No call'). Appended this to the supplament.
SVG location for images:
Agree: 8:79,610,450-79,700,278 1:144,830,279-144,888,748
Disagree: 7:1397571..1414180 3:227493188..227499857 (6.67 Kb)
Additiona info from out methods: 1:123,373,180-123,465,479
The manuscript Using multiple reference genomes to identify and resolve annotation inconsistencies is a recent publication by Monnahan et al from 2020. It details a clever approach which I appreciate which attempts to identify potentially split genes in the maize genome.
It does so by comparing multiple versions of maize genome annotations (B73 - Ph207 - W22) against one another, and compares them in a pairwise manner to identify genes with a "One gene hits many in another" genome approach. This can be somewhat seen in their figure 1.
The manuscript goes on to basically test whether these sets of genes should be merged - or kept as split genes. Asking the question - "which annotation is correct?" It does so by looking at RNA-seq data across multiple different tissue types which asking whether the 'split' genes involved are actually similarly expressed. This is summed in their metric M2f (Mean 2 fold expression difference) - Again figure 2 here:
It's definitely an interesting approach and one I appreciate. Something to note here though is that the cut off they're using - this "top 10 of m2f" metric is very stringent (which makes sense). Basically if the values of M2F for one of their calculated pairs doesn't fall on either side (the split or merged side) of the distribution, they do not "make a call" as it were. So this does mitigate conclusions they can make about some of the potential pairs they have. For instance - initially they identify 481 potential mergers in B73 - and end up identifying 96 potential mergers, 170 genes which should remain split, and 240 which were unable to be called (this adds up to 506??).