Jome0169 / MendietaPablo_Annotation_Paper_scripts

Scripts used for analysis and pipline of Mendieta et al 2020
1 stars 0 forks source link

Comparing Split Annotation Class from Gramene #3

Closed Jome0169 closed 3 years ago

Jome0169 commented 3 years ago

4/26/2021

One reviewer wants me to further this analysis by looking at the split gene annotations found within gramene. This is a set of annoations which were IDd on gramene as being split, but for some reason, have not been fixed in the main annotation.

ftp://ftp.gramene.org/../pub/gramene/CURRENT_RELEASE/split_genes/ <- Link to the ftp site provided by the reviewer. A lab mate assisted me in retrieveing this data, and from there I have been "off to the races" as it were. Utilized an awk command to generate the bed file: awk -v OFS='\t' '{if ($7=="1") print $4,$5,$6,$3,"+"; else print $4,$5,$6,$3,"-"}' zea_mays_split_genes_gramene.txt | grep -v "chr_start" | uniq | bedtools sort -i - > zea_mays_split_genes_gramene.sorted.bed . From there realized that this is all the genes in a single list, not the genes paired with their corresponding merged partner.

Used a bedtools closest command in order to pair up genes - merged pairs should be equally distant from each other, and those that aren't will just be printed for later analysis and use. Made quick python script to help with this as well.

Bedtools command: ❯ bedtools closest -a zea_mays_split_genes_gramene.sorted.bed -b zea_mays_split_genes_gramene.sorted.bed -io -t first -d > closest_thing.txt

❯ python quick_fix.py closest_thing.txt > mostly_fixed.txt

Opened the mostly_fixed.txt file in excel and fixed the remaining triplet pairs there. Copy and ptasted the output from the excel editing into a vim to remove weird formatting, and finally put this file into bedtools sort. Finally saved the final file as Gramene_split.final.sorted.bed.

In total there are 78 gramene merged gene pairs/triplets. Doing a simple bedtools intersect command we find that we capure 31 out of the 78, so 40%. that's fine. Probably a decent proportion of these do not fall within regions where we have data.

Fixing GTF File display

https://github.com/daler/gffutils/issues/137