Jome0169 / MendietaPablo_Annotation_Paper_scripts

Scripts used for analysis and pipline of Mendieta et al 2020
1 stars 0 forks source link

Notes #1

Closed Jome0169 closed 3 years ago

Jome0169 commented 4 years ago

5/12/2019

Alternative Species

Downloaded the other species data in the directory /scratch/jpm73279/04.lncRNA/06.annotate_other_species_2020-05-11 using the script submit_fastq_dump_download.sh . I have now downloaded all the species Zefu has given me which is excellent. Now, I just need to modify the pipeline in such as way as to more pliable in regards to the inclusion of having replicates or not.

Getting the species into the correct formats I think will be a little irritating, as they are all going to require me to weed out scaffold level assemblies, as well as generate the separate lincRNA files for each genome. Initially I think I should start on maybe getting the most interesting genomes going first - focusing on sorghum, Asparagus, Eutreme, Gycine Max

Final Metaplots

I've generated what feels like maybe 1000 metaplots, but I like to think we're closing in on the final set of metaplots.

5/15/2019

Alternative Species Continuted

Generated the directory structure for alternative species, which looks like this: ├── 05.A_officialnalis │   ├── 00.data │   │   ├── bed_file │   │   ├── chip_seq │   │   │   ├── H3 │   │   │   ├── H3K27me3 │   │   │   ├── H3K36me3 │   │   │   ├── H3K4me1 │   │   │   ├── H3K4me3 │   │   │   ├── H3K56ac │   │   │   ├── H3K9 │   │   │   └── input │   │   ├── reference │   │   └── rna

Based off of the structure which I used for Z.mays. Probably overkill, but this is what is going to allow me to quickly implement this in other species

Jome0169 commented 4 years ago

5/19/2020

Generating metaplots of genes which are on and off

Genes which are on are those >1 tpm and are mappable for each of the tissue types - the list I'm working on for this set of files if found here - /scratch/jpm73279/04.lncRNA/02.Analysis/37.control_proportion_lncRNAs_expressed_captured/02.TPM_passing_mappability and the files are named RNA_B73_ear_TPM.surviving.txt RNA_B73_leaf_TPM.surviving.txt RNA_B73_root_TPM.surviving.txt

Script to generate on/off files

   1 set -euxo pipefail
  2 awk '$2 < 1 {print $1}' ${1} > ${2}_off_gene_list.txt
  3 grep -f ${2}_off_gene_list.txt mappable_genes.values.passing.bed > ${2}_off_genes.bed
  4
  5 awk '$2 >= 1 {print $1}' ${1} > ${2}_on_gene_list.txt
  6 grep -f ${2}_on_gene_list.txt mappable_genes.values.passing.bed > ${2}_on_genes.bed 

Run using the command: parallel 'bash generate_on_off_genes.sh {} {.}' ::: *.txt

Selected Alternative Species

Not all species are of equal quality, since this analysis is dependent on the quality of ChIP-seq data, I'm biasing the further analysis to genomes with good ChIP data. The species to be used are

Jome0169 commented 4 years ago

5/20/10

Working on Generating Upset Plots / Ven Diagrams

Upset plots and ven diagrams are something i'm going to work on and finalize today. The combination of these two figures is an attempt to show the readers that these data are biologically meaningful. Currently the overlaps are between those genes with a "promoter" region and a "gene body" region. I will more than likely need to expand this as well to show what proportion of promoter regions also overlap a silencing loci.

Got upset plots working, and they are looking NICE!!!! leaf_upset.pdf

UpsetR works far better than the original upset plots - so I'm a big fan overall. Now just to get the ven diagrams working a bit better. However, having issues with the overlap regions flipping depending on the size of the interesection. Attemtpting to fix this by manually ordering the ven diagram overlaps, but this is proving to be a pain in the ass. Related issue found here: https://stackoverflow.com/questions/23687803/manual-ordering-of-sets-in-r-venndiagram?answertab=active#tab-top

Eventually got this working, realized the way I was assigning the plot names was incorrect. Finally - scripts used to generate the below figure was:

upsetPlots: generate_upset_plots.R Ven Diagrams: generate_ven_diagrams.R Both scripts were working on data found in the directionry dir: /Users/feilab/Projects/03.ncRNA_project/03.Figures/Figure2

Now the rough draft is currently looking like

Figure_2_test

Bobs comments were as follows:

looks sharp.  are you going to explore why there are so many more k56ac/k4me3 peaks that k36me3/k4me1

although on the right side i don't know what the labels are.  i'm assuming overlap, k4, k56?

oh i see "initiation" and "elongation" i would change this to k4me1/k36me3 and k4me3/k56ac to be more accurate

it's interesting that so many "initiation" are lower expressed. i wonder if there are cell type specific so the k4me1/k36me3 enrichment is low.  you could make metaplots of k4me3/k56ac for the "elongation group" and k36me3/k4me1 for the "initation" group and compare to overlap groups.  it might show there is weak encirhment or maybe nothing at all
Jome0169 commented 4 years ago

5/21/2020

Metaplots for Elongation/Initiation only classes

Bob mentioned he wanted to see metaplots for these two classes - so to do this I appeneded a little function at the end of generate_upset.R that outputs two bed files of genes with only elongation modifications, or genes with only initiation modifications. Such ear_elongation_only.bed, ear_initiation_only.bed - I copied these bed files over to the cluster, into data_copy_over scp -r 02.isolated_markes jpm73279@xfer.gacrc.uga.edu:/home/jpm73279/data_copy_over/ , and linked these files to the bed file section of /scratch/jpm73279/04.lncRNA/02.Analysis/23.generate_all_metaplots - added them to the snakemake, and am currently running them to generate all the possible metaplot intersections we could possibly want.

On/Off Metaplot Generation

Generated metaplots, and found a weird trend in leaf, there's a chunk of "Off genes" which look very on, meaning they seem to have an enrichment for K36me3 and K4me1.
leaf_final_on_off.pdf

Diving into this group though, it has become apparent that the large portion of these loci are just off genes overlapping an "On gene" throwing off this analysis.

Example: http://epigenome.genetics.uga.edu/JBrowse/?data=maize_v4&loc=8%3A138011130..138020679&tracks=DNA%2Crnas%2Cgenes%2Ctransposons%2CB73_leaf%2Ckmer75_0_MM_genmap%2CChip_unique_B73_leaf_H3K27me3_rep3%2CChip_unique_B73_leaf_H3K27me3_rep4%2CChip_unique_B73_leaf_H3K36me3_rep1%2CChip_unique_B73_leaf_H3K36me3_rep2%2CChip_unique_B73_leaf_H3K4me1_rep2%2CChip_unique_B73_leaf_H3K4me1_rep1%2CChip_unique_B73_leaf_H3K4me3_rep1%2CChip_unique_B73_leaf_H3K4me3_rep2%2CChip_unique_B73_leaf_input_rep1%2CChip_unique_B73_leaf_input_rep2%2CSchmitzlab_RNA_data_B73_leaf_rep1%2CSchmitzlab_RNA_data_B73_leaf_rep2&highlight=

In order to fix this I'm only going to "off genes" that are not intersecting an "on gene ". Modified the file generate_on_off_genes.sh to instead look like: (Note the bed tool section)

  1 set -euxo pipefail
  2 awk '$2 < 1 {print $1}' ${1} > ${2}_off_gene_list.txt
  3 grep -f ${2}_off_gene_list.txt mappable_genes.values.passing.bed > ${2}_off_genes.tmp.bed
  4
  5 awk '$2 >= 1 {print $1}' ${1} > ${2}_on_gene_list.txt
  6 grep -f ${2}_on_gene_list.txt mappable_genes.values.passing.bed > ${2}_on_genes.bed
  7
  8 bedtools intersect -a ${2}_off_genes.tmp.bed -b ${2}_on_genes.bed -v > ${2}_off_genes.bed
  9
 10 rm ${2}_off_genes.tmp.bed

Currently rerunning the generation of metaplot bin files

Calling H3K27me3 Peaks

Working dir: /scratch/jpm73279/04.lncRNA/02.Analysis/02.call_peaks Working off a modified version of my annotation script - instead of going through whole pipeline, will instead just go up until the peak calling section

Jome0169 commented 4 years ago

leaf_initiation_only_leaf_plot_draw_2.pdf

5/21/2020

Worked on generating Initiation Only and Elongation Only Metaplots

R script: etaplot_initiation_elongation_only.R Example of plots generated: leaf_elongation_only_leaf_plot_draw_2.pdf

Possible scaling issue - its clear to me that since I'm normalizing by gene, and extending them to be 5000 bp, this causes some of the metaplots to be fucked, especially for my smaller group of genes which are initiation only. It would make sense then that these loci have the entirety of their signal widened out, hence why it looks more like H3 and H3K4me3. It's just a byproduct of scaling. I guess I'll have to show bob that - but everytime shit like this pops up, it's quite a bit of work to fix

Jome0169 commented 4 years ago

5/25/2020 Happy Memorial Day!

Metaplot Elongation/Initiation only

Re-ran the metaplots using just the TSS, and it seemed to improve the results and remove that H3 Looking peak thing. I think it really clarified the analysis, and was a clear demonstration that the scaling issue is what was causing that abborant signal.

leaf_initiation_only_leaf_TSS_non_scaled.pdf

I generated these figures as well for the elongation class of genes which ONLY have elongation. leaf_elongation_only_leaf_TSS_non_scaled.pdf

Overall I think it's a pretty compelling figure showing annotation issues.

Upset plots multi split

Also updated the upset plots to show those regions which do NOT intersect our peaks, As well as made a couple other metaplots looking at what the individual intersection of mods looks like, versus what it looks like when we require genes to have both modifications to classify as being "Expressed"

Original - Elognation/Initiation defined as having ONE of the mods leaf_upset.pdf

Elongation/Initiation is defined as having BOTH mods leaf_upset.both_mods_req.pdf

Upset plot of ALL splits leaf_upset.split_mods.pdf

Jome0169 commented 4 years ago

Just talked to Bob, and he made a point that there's this weird amount of discongruity between the peak calls, and the relative number of K36me3/K4me1 overlaps, and a much smaller number of genes overlapping these based off the metaplot. So for instance,

leaf_combined ven Here we're finding 15,456 peaks of Elongation overlapping each other, but the upset we're only capturing 8,000 regions, opening up the question - wtf is happening to those other ~8000 peaks

leaf_upset.both_mods_req.pdf

Jome0169 commented 4 years ago
Tracing back Ven diagram error

Couple of things - realized that the Ven diagram issues of the values in the ven diagram plot not matching the values in the the number of genes overlapping was caused by working on old ven diagram data. Fixed this by re-downlaoding, and re-running the command generate_ven_counts.sh . So, now the intersect in the ven diagram makes sense.

HOWEVER - after really going back and looking at called peaks, I realized that one thing I was doing was pretty dumb - I was merging replicates, and not taking the intersect which is a big no no I am not realizing. Merging these loci makes it near on impossble to accuretly account for where things are coming from. So in Generate_updated_annotation.snake, I replaced

294 #cat {input.replicate1} {input.replicate2} | bedtools sort -i - | bedtools \
295 #merge -d 25 -c 4,5 -o collapse -i - | bedtools sort -i - > {output} ;

with

bedtols intersect -a {input.replicate1} -b {input.replicate2} | bedtools sort -i - > {output}
Jome0169 commented 4 years ago

5/26/2019

Fixed the ven diagrams mostly - re downloaded data from sapelo2, and realized the called peaks and intersections were based on an old batch of files, and not currently up to date. Fixed this by downloading the new files, rerunning the ven diagram script, and BOOM! Things look quite a bit better.

leaf_combined ven

Way higher number of peaks, which is good. Also see this improvement in the upset plot as well - leaf_upset.both_mods_req.pdf

However, there is one thing I need to fix - when generating these upset plots, and left_joining() the TPM file with the overlap file, there are quite a few genes which boot back NAs with their gene overlap information. Many of these genes are ncRNA genes, which makes sense considering this overlap was done for only protein coding genes, and not ncRNA genes. So, looking like I'll need to go back and fix this quickly at some point to make sure I'm getting ALL genes with greater than 1 TPM.

To Note as well: something weird has been happening with the number of overlaps between replicates, and the ven diagrams generated in the pipeline. For whatever reason, the overlap seen in the ven diagram does NOT equal the number of replicates .

On/Off Metaplots

Something I need to check on quickly is how I'm using mappability in these experiments. Currently, the way I'm using it is that the genes I'm considering are those that pass my mappability requirements, which require a map score >.65 mappability. This value is set in the

DIr: /scratch/jpm73279/04.lncRNA/02.Analysis/37.control_proportion_lncRNAs_expressed_captured Script: lncRNA_tpm_calculation.snake

all_genes_mappability

This seems to substantially increase the number of genes considered which is good. At some point today, I'll need to go back and log the number of genes surviving this filter - but broad guess based off the above chart is we can talk about >75% of the genes in the genome using this method.

Redoing this mappability made me re-run the metaplot generation of surviving genes on/off - step

1) Re-gen on/off class parallel 'bash generate_on_off_genes.sh {} {.}' ::: *.txt in Dir: /scratch/jpm73279/04.lncRNA/02.Analysis/37.control_proportion_lncRNAs_expressed_captured/02.TPM_passing_mappability

2) Re-gen metaplots Dir:/scratch/jpm73279/04.lncRNA/02.Analysis/23.generate_all_metaplots Command:qsub submit_metaplot_generation.sh

This also made me redo the upset plots which was pretty quick, as mentioned above re downloaded the TPM files from the mappability directory. Re-generated the "initiation only" class, as well as the "elongation only" class of elements as BED files, and will need to re-upload these guys to the cluster, and. then re-reun metaplots just like above.

Currently running - redid the write bed file output as well, so we should be good to go. Submitted job - ID # 2451502.sapelo2

Jome0169 commented 4 years ago

5/27/2020

On/Off Metaplots

Finally got these working last night - sorted them based off gene expression as well, so these should be good to go once they're re-generated this morning.

For the most part these plots look good - but something appears to be happening with the ROOT samples - doesn't look like the CHIP or ATAC-seq was nearly as good which is pretty irritating. I'm sure I will be asked about this by Bob once I show him the final figure. Speaking of which, it looks overall pretty solid.

Something I still need to do is adjust the padding on the metaplots - which will be a pain in the ass considering I'll have to redraw them for the 20000th time - and they take about 20 min to process, but whatever. All figures for Figure 2 as I'm calling it can be found here Dir: /Users/feilab/Projects/03.ncRNA_project/03.Figures/Figure2/imgs with the metaplots being in a sub directory here /Users/feilab/Projects/03.ncRNA_project/03.Figures/Figure2/imgs/metaplots. Scripts used to generate these files were generate_upset_plots.R , generate_ven_diagram.R metaplots_figures.R . Metaplot data dir: /Users/feilab/Projects/03.ncRNA_project/02.Analysis/lncRNA_copy_files/2020-01-06_metaplots_lncRNA_RNAOnly_vs_Myset and Upsetplot/Vendiagram data dir: /Users/feilab/Projects/03.ncRNA_project/02.Analysis/lncRNA_copy_files/2019-11-07_gene_lncRNA_TPM_overlap_check/

Draft 1:

Figure_2_merged

Jome0169 commented 4 years ago

5/29/2020

Talked to Bob this morning - overall figure is looking pretty solid. Something I still need to figure out is where all those extra Initiation Peak overlaps are going. Only about ~16000 of the ~28000 peaks are reported in expressed genes - asking the question - where are these extra peaks coming from? This could be an issue of being too accepting of this type of peak calling - or it could be that there are probably alot of poised promoters in the genome which is what I think it probably happening. However, I need to prove this.

To do this -- I'm calling peaks on H3K27me3 regions, and we're going to see what proportion of those regions which do NOT overlap with these regions are good to go.

http://epigenome.genetics.uga.edu/JBrowse/?data=maize_v4&loc=1%3A130585421..130602780&tracks=DNA%2Crnas%2Cgenes%2Ctransposons%2CB73_leaf%2Ckmer75_0_MM_genmap%2CChip_unique_B73_leaf_H3K27me3_rep3%2CChip_unique_B73_leaf_H3K27me3_rep4%2CChip_unique_B73_leaf_H3K36me3_rep1%2CChip_unique_B73_leaf_H3K36me3_rep2%2CChip_unique_B73_leaf_H3K4me1_rep2%2CChip_unique_B73_leaf_H3K4me1_rep1%2CChip_unique_B73_leaf_H3K4me3_rep1%2CChip_unique_B73_leaf_H3K4me3_rep2%2CChip_unique_B73_leaf_input_rep1%2CChip_unique_B73_leaf_input_rep2%2CChip_unique_B73_leaf_H3K56ac_rep1%2CChip_unique_B73_leaf_H3K56ac_rep2%2Cleaf_rep1_H3K4me3_peaks_narrow_region.filtered%2Cleaf_rep2_H3K4me3_peaks_narrow_region.filtered%2Cleaf_rep2_H3K56ac_peaks_narrow_region.filtered%2Cleaf_rep1_H3K56ac_peaks_narrow_region.filtered&highlight=

http://epigenome.genetics.uga.edu/JBrowse/?data=maize_v4&loc=1%3A165481637..165483372&tracks=DNA%2Crnas%2Cgenes%2Ctransposons%2CB73_leaf%2Ckmer75_0_MM_genmap%2CChip_unique_B73_leaf_H3K27me3_rep3%2CChip_unique_B73_leaf_H3K27me3_rep4%2CChip_unique_B73_leaf_H3K36me3_rep1%2CChip_unique_B73_leaf_H3K36me3_rep2%2CChip_unique_B73_leaf_H3K4me1_rep2%2CChip_unique_B73_leaf_H3K4me1_rep1%2CChip_unique_B73_leaf_H3K4me3_rep1%2CChip_unique_B73_leaf_H3K4me3_rep2%2CChip_unique_B73_leaf_input_rep1%2CChip_unique_B73_leaf_input_rep2%2CChip_unique_B73_leaf_H3K56ac_rep1%2CChip_unique_B73_leaf_H3K56ac_rep2%2Cleaf_rep1_H3K4me3_peaks_narrow_region.filtered%2Cleaf_rep2_H3K4me3_peaks_narrow_region.filtered%2Cleaf_rep2_H3K56ac_peaks_narrow_region.filtered%2Cleaf_rep1_H3K56ac_peaks_narrow_region.filtered&highlight=

http://epigenome.genetics.uga.edu/JBrowse/?data=maize_v4&loc=1%3A156539521..156543860&tracks=DNA%2Crnas%2Cgenes%2Ctransposons%2CB73_leaf%2Ckmer75_0_MM_genmap%2CChip_unique_B73_leaf_H3K27me3_rep3%2CChip_unique_B73_leaf_H3K27me3_rep4%2CChip_unique_B73_leaf_H3K36me3_rep1%2CChip_unique_B73_leaf_H3K36me3_rep2%2CChip_unique_B73_leaf_H3K4me1_rep2%2CChip_unique_B73_leaf_H3K4me1_rep1%2CChip_unique_B73_leaf_H3K4me3_rep1%2CChip_unique_B73_leaf_H3K4me3_rep2%2CChip_unique_B73_leaf_input_rep1%2CChip_unique_B73_leaf_input_rep2%2CChip_unique_B73_leaf_H3K56ac_rep1%2CChip_unique_B73_leaf_H3K56ac_rep2%2Cleaf_rep1_H3K4me3_peaks_narrow_region.filtered%2Cleaf_rep2_H3K4me3_peaks_narrow_region.filtered%2Cleaf_rep2_H3K56ac_peaks_narrow_region.filtered%2Cleaf_rep1_H3K56ac_peaks_narrow_region.filtered%2CSchmitzlab_RNA_data_B73_leaf_rep1&highlight=

http://epigenome.genetics.uga.edu/JBrowse/?data=maize_v4&loc=1%3A174668341..174672680&tracks=DNA%2Crnas%2Cgenes%2Ctransposons%2CB73_leaf%2Ckmer75_0_MM_genmap%2CChip_unique_B73_leaf_H3K27me3_rep3%2CChip_unique_B73_leaf_H3K27me3_rep4%2CChip_unique_B73_leaf_H3K36me3_rep1%2CChip_unique_B73_leaf_H3K36me3_rep2%2CChip_unique_B73_leaf_H3K4me1_rep2%2CChip_unique_B73_leaf_H3K4me1_rep1%2CChip_unique_B73_leaf_H3K4me3_rep1%2CChip_unique_B73_leaf_H3K4me3_rep2%2CChip_unique_B73_leaf_input_rep1%2CChip_unique_B73_leaf_input_rep2%2CChip_unique_B73_leaf_H3K56ac_rep1%2CChip_unique_B73_leaf_H3K56ac_rep2%2Cleaf_rep1_H3K4me3_peaks_narrow_region.filtered%2Cleaf_rep2_H3K4me3_peaks_narrow_region.filtered%2Cleaf_rep2_H3K56ac_peaks_narrow_region.filtered%2Cleaf_rep1_H3K56ac_peaks_narrow_region.filtered%2CSchmitzlab_RNA_data_B73_leaf_rep1&highlight=

http://epigenome.genetics.uga.edu/JBrowse/?data=maize_v4&loc=1%3A222947186..222951470&tracks=DNA%2Crnas%2Cgenes%2Ctransposons%2CB73_leaf%2Ckmer75_0_MM_genmap%2CChip_unique_B73_leaf_H3K27me3_rep3%2CChip_unique_B73_leaf_H3K27me3_rep4%2CChip_unique_B73_leaf_H3K36me3_rep1%2CChip_unique_B73_leaf_H3K36me3_rep2%2CChip_unique_B73_leaf_H3K4me1_rep2%2CChip_unique_B73_leaf_H3K4me1_rep1%2CChip_unique_B73_leaf_H3K4me3_rep1%2CChip_unique_B73_leaf_H3K4me3_rep2%2CChip_unique_B73_leaf_input_rep1%2CChip_unique_B73_leaf_input_rep2%2CChip_unique_B73_leaf_H3K56ac_rep1%2CChip_unique_B73_leaf_H3K56ac_rep2%2Cleaf_rep1_H3K4me3_peaks_narrow_region.filtered%2Cleaf_rep2_H3K4me3_peaks_narrow_region.filtered%2Cleaf_rep2_H3K56ac_peaks_narrow_region.filtered%2Cleaf_rep1_H3K56ac_peaks_narrow_region.filtered%2CSchmitzlab_RNA_data_B73_leaf_rep1&highlight=

http://epigenome.genetics.uga.edu/JBrowse/?data=maize_v4&loc=9%3A110364741..110381880&tracks=DNA%2Crnas%2Cgenes%2Ctransposons%2CB73_leaf%2Ckmer75_0_MM_genmap%2CChip_unique_B73_leaf_H3K27me3_rep3%2CChip_unique_B73_leaf_H3K27me3_rep4%2CChip_unique_B73_leaf_H3K36me3_rep1%2CChip_unique_B73_leaf_H3K36me3_rep2%2CChip_unique_B73_leaf_H3K4me1_rep2%2CChip_unique_B73_leaf_H3K4me1_rep1%2CChip_unique_B73_leaf_H3K4me3_rep1%2CChip_unique_B73_leaf_H3K4me3_rep2%2CChip_unique_B73_leaf_input_rep1%2CChip_unique_B73_leaf_input_rep2%2CChip_unique_B73_leaf_H3K56ac_rep1%2CChip_unique_B73_leaf_H3K56ac_rep2%2Cleaf_rep1_H3K4me3_peaks_narrow_region.filtered%2Cleaf_rep2_H3K4me3_peaks_narrow_region.filtered%2Cleaf_rep2_H3K56ac_peaks_narrow_region.filtered%2Cleaf_rep1_H3K56ac_peaks_narrow_region.filtered%2CSchmitzlab_RNA_data_B73_leaf_rep1&highlight=

http://epigenome.genetics.uga.edu/JBrowse/?data=maize_v4&loc=9%3A60732501..60749640&tracks=DNA%2Crnas%2Cgenes%2Ctransposons%2CB73_leaf%2Ckmer75_0_MM_genmap%2CChip_unique_B73_leaf_H3K27me3_rep3%2CChip_unique_B73_leaf_H3K27me3_rep4%2CChip_unique_B73_leaf_H3K36me3_rep1%2CChip_unique_B73_leaf_H3K36me3_rep2%2CChip_unique_B73_leaf_H3K4me1_rep2%2CChip_unique_B73_leaf_H3K4me1_rep1%2CChip_unique_B73_leaf_H3K4me3_rep1%2CChip_unique_B73_leaf_H3K4me3_rep2%2CChip_unique_B73_leaf_input_rep1%2CChip_unique_B73_leaf_input_rep2%2CChip_unique_B73_leaf_H3K56ac_rep1%2CChip_unique_B73_leaf_H3K56ac_rep2%2Cleaf_rep1_H3K4me3_peaks_narrow_region.filtered%2Cleaf_rep2_H3K4me3_peaks_narrow_region.filtered%2Cleaf_rep2_H3K56ac_peaks_narrow_region.filtered%2Cleaf_rep1_H3K56ac_peaks_narrow_region.filtered%2CSchmitzlab_RNA_data_B73_leaf_rep1&highlight=

Quick test to see where many of these regions are falling

## Number of regions overlapping a gene turned OFF

bedtools intersect -a histone_mods/tis_leaf_mod_H3K56ac_group_narrow_merged.bed -b histone_mods/tis_leaf_mod_H3K4me3_group_narrow_merged.bed | bedtools intersect -a - -b histone_mods_intersecting_genes/genes_leaf_overlapping_H3K36me3_broad.true.off.bed| wc -l
**Output: 5420**

## Number of regions overlapping a gene which is ON 
bedtools intersect -a histone_mods/tis_leaf_mod_H3K56ac_group_narrow_merged.bed -b histone_mods/tis_leaf_mod_H3K4me3_group_narrow_merged.bed | bedtools intersect -a - -b histone_mods_intersecting_genes/genes_leaf_overlapping_H3K36me3_broad.true.bed | wc -l
**Output: 20932**

## Number of regions which are not overlapping ANY of these regions.
bedtools intersect -a histone_mods/tis_leaf_mod_H3K56ac_group_narrow_merged.bed -b histone_mods/tis_leaf_mod_H3K4me3_group_narrow_merged.bed | bedtools intersect -a - -b histone_mods_intersecting_genes/genes_leaf_overlapping_H3K36me3_broad.bed -v | wc -l
**Output: 6088** 

So, remembering back to this, one of this discordant issues I remember seeing here is there was a class of ncRNAs which I didn't factor into the overlapping genes analysis. This was due to me seperating the genes/lncRNAs for looking at overlap between them individually. So, I remember there being about ~6000 genes which were not factored into the downstream analsysis of overlapping a gene. So this might need to be fixed.

For instance, in the data used in the upset plot - here is the result of the number of genes with TPM values reported that do NOT have any hits in the corresponding histone files because they were weeded out.

#Genes with no histone mods corresponding
Observations: 6,616
Variables: 6
$ TPM      <dbl> 1.1151301, 14.0413865, 0.8250455, 0.0000000, 86.803…
$ gene_ID  <chr> "gene:Zm00001d027252", "gene:Zm00001d022649", "gene…
$ H3K36me3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ H3K4me1  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ H3K4me3  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ H3K56ac  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

## Genes with greater than 1 TPM 
> final <- leaf_overlap_expression %>% 
+   filter(is.na(H3K36me3) == TRUE & TPM > 1) %>% 
+   glimpse()
Observations: 789
Variables: 6
$ TPM      <dbl> 1.115130, 14.041387, 86.803772, 5.087781, 19.316321…
$ gene_ID  <chr> "gene:Zm00001d027252", "gene:Zm00001d022649", "gene…
$ H3K36me3 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ H3K4me1  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ H3K4me3  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ H3K56ac  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

So, after thinking about it something that might be an issue here is the fact that at times there ar tons of "array peaks" of these marks. Initiation marks are small, which makes calling them almost harder, because it seems like the false posotive rate is higher than you would think. One way I can assay this is by doing two quick bedtools scals -

bedtools intersect on gene set, and count the number of times a gene hits an initiaion peak - I bet there will be quite alof of them that hit 2+ peaks. And second, generate a histogram of cloeset using initiation mods versus themselves.

Merge marks - 
bedtools intersect -a histone_mods/tis_leaf_mod_H3K56ac_group_narrow_merged.bed -b histone_mods/tis_leaf_mod_H3K4me3_group_narrow_merged.bed | bedtools sort -i - | bedtools merge -i - > tis_leaf_init_marks_merged.sorted.bed

bedtools closest -a tis_leaf_init_marks_merged.sorted.bed -b tis_leaf_init_marks_merged.sorted.bed -io  -d > tis_leaf_init_marks_merged.dist_to_cl
osest.bed

Number of peaks intersection gene features
❯ bedtools intersect -a /Users/feilab/Programming/38.update_annotation/00.data/annotation/Zea_mays.AGPv4.38.chr.genes.bed -b tis_leaf_init_marks_merged.sorted.bed -c > Zea_mays.AGPv4.38.chr.genes.intersect_init.bed

Issue Area: http://epigenome.genetics.uga.edu/JBrowse/?data=maize_v4&loc=8%3A80229721..80246860&tracks=DNA%2Crnas%2Cgenes%2Ctransposons%2CB73_leaf%2Ckmer75_0_MM_genmap%2CChip_unique_B73_leaf_H3K27me3_rep3%2CChip_unique_B73_leaf_H3K27me3_rep4%2CChip_unique_B73_leaf_H3K36me3_rep1%2CChip_unique_B73_leaf_H3K36me3_rep2%2CChip_unique_B73_leaf_H3K4me1_rep2%2CChip_unique_B73_leaf_H3K4me1_rep1%2CChip_unique_B73_leaf_H3K4me3_rep1%2CChip_unique_B73_leaf_H3K4me3_rep2%2CChip_unique_B73_leaf_input_rep1%2CChip_unique_B73_leaf_input_rep2%2CChip_unique_B73_leaf_H3K56ac_rep1%2CChip_unique_B73_leaf_H3K56ac_rep2%2Cleaf_rep1_H3K4me3_peaks_narrow_region.filtered%2Cleaf_rep2_H3K4me3_peaks_narrow_region.filtered%2Cleaf_rep2_H3K56ac_peaks_narrow_region.filtered%2Cleaf_rep1_H3K56ac_peaks_narrow_region.filtered%2CSchmitzlab_RNA_data_B73_leaf_rep1&highlight=

Jome0169 commented 4 years ago

Made a quick graph of the data above - looks like many genes are overlapping 2 of these regions which is ...WRONG... Gene_intersection_initiation_count

Also - most inititation modification data is SUPER close to each other, so need to re-tune and recall initiation peaks. Initiation_mods_distance_to_closest

Jome0169 commented 4 years ago

6/2/2020

Yesterday switched pipeline from using peakRanger to MACS2. Peakranger is just too unrelaiable, and it has issues with smaller scaffolds. So, when considering I will be building this pipeline out, doesn't make sense considering that many of the other species genomes this will be running on will not include 'good' assemblies. So, this has been changed in my pipeline.

That being said this now has come along with it's own issues. MACS2 is far worse at calling broad peaks, and calls hyper fragmented peaks. This is an issue when intersecting replicates, as replicates both have very different peak calls, dropping a high number of peaks.

leaf_combined ven leaf_upset split_mods

So, before calling this transition to MACS2 succesful, I'm going to need to do a similar analysis above on the replicates of the peaks calls to figure out the correct merge command to be done on the peaks called in replicates - in order to generate more realisitic large peaks instread of the hyper fragmented ones.

Command comparing closeness of peaks within itself (How fractured are peaks?)

~/Projects/03.ncRNA_project/02.Analysis/MACS2_merge_analysis
❯ parallel "bash closest_command_self.sh {} {.}" ::: histone_mods_replicates/*

Ear K4me1 ear_K4me1_reps_self Ear K36me3 ear_K36me3_reps_self Leaf K4me1 leaf_K4me1_reps_self Leaf K36me3 leaf_K36me3_reps_self

Command comparing closeness of replicate peak calls (How reproducible are fractured peaks?)

~/Projects/03.ncRNA_project/02.Analysis/MACS2_merge_analysis/histone_mods_replicates
❯ while IFS=" " read -r file_1 file_2 output_file ; do bash closest_command_replicate.sh ${file_1} ${file_2} ${output_file}; done < rep_pairs.txt

Leaf K4me1 leaf_K4me1_between_reps Ear K4me1 ear_K4me1_reps_self Leaf K56ac (just curious) leaf_K56ac_between_reps Leaf K36me3 leaf_K36me3_between_reps

So - from these analysis it's clear to me that we need to fuse/merge peaks within at least 1200 base pairs within broad calls. This will now need to be updated.

Jome0169 commented 4 years ago

Previous analysis was done in dir: /Users/feilab/Projects/03.ncRNA_project/02.Analysis/MACS2_merge_analysis

Attempting to get into the home stretch here, and be done with this fucking analysis. Jfc. Realized we were pretty good to go, until one of the ear replicates of H3K4me1 had upwards of 90000 peaks. Realized since I was merging my replicates instead of taking the intersect this was an issue. I have fixed this by going back and taking the intersect between replicates instead.

This is why I had to choose intersect instead of merger. A big part of this is also coming from the way I'm calling broad peaks in MACS2 by using the --nolambda paramter, which actually allows MACS2 to call broad peaks more accuretly.

ear_H3K4me1_broad_intersection

However, with this comes issues in t he Upset plot because of COURSE that's what happens - leaf_upset.both_mods_req.pdf

leaf_upset.split_mods.pdf

Issue now being the intersect is leaving out a large group of promoter regions, as well as a signifigant group of genes with no overlapping K4me1 overlap but the other three modifications. This. Is. So. Irritating. Feels like two step forwards one step back.

Jome0169 commented 4 years ago

▶ awk '$3 == "gene" {print $0}' Sviridis_500_v2.1.gene.gff3 | gff2bed | less -S > Sviridis_500_v2.1.gene.bed

6/8/2020

Alot of last week was incredibly frusturating because I was optimizing MACS2 for broad peak calling, which it was clearly underperforming at. At no point did I feel comfortable with the parameter choices I was making, as many of them were turning off what I thought to be critical functions of error checkign such as the local lambda function in MACS2. This morning, after almost zero tuning, I got epics2 working far better than MACS2 broad ever did, and I nkow this peak called is optimized for broad peak calls. We are switching to EPICS2. I really hope this is the FINAL time I have to switch peak callers, but I have a feeling this is the correct solution.

Link to EPIC2: https://github.com/biocore-ntnu/epic2 Command I got workign ASAP: epic2 --treatment leaf_H3K36me3_chrom10.bam --control leaf_input_chrom10.bam --chromsizes chrom_10_names.txt --false-discovery-rate-cutoff .1 --keep-duplicates --output epic2_fdr_10

Quick Example:

Screen Shot 2020-06-08 at 9 59 27 AM

NO MORE PEAK CALLER HORSE SHIT.

Jome0169 commented 4 years ago

6/9/2020

Updated the annotation script to use epic2 as mentioned yesterday. Less tuning, better performance. However, a classic issue has creaped up now of the ear H3K4me1 replicate being way worse than the original.

ear_ext_ven_plot ven

This was further validated when I looked at the replicate intersects. ear_H3K4me1_broad_intersection

I've never been able to retrace the original replicate I was using in winter sadly. So with this in mind, we're just going to not take interesections of ear_H3K4me1. There is currently no way around this, and I would rather just move forward and not spend more time sear

FInally got things sorted out - called peaks on K27me3 as well to intersect promoter regions which do NOT overlap K36me3 regions. These peaks were called in the directory:

dir: /scratch/jpm73279/04.lncRNA/02.Analysis/02.call_peaks snake_file: call_chip_seq_peaks.snake copied filed to: 2020-06-09_H3K27me3

Intermal snakemake file here is very simple, and just a call to epics2.

Jome0169 commented 4 years ago

Going forward - finding a ton of regions of enrichement for initiation, and low values of elongation? Small ass genes?

http://epigenome.genetics.uga.edu/JBrowse/?data=maize_v4&loc=4%3A55207561..55225500&tracks=rnas%2Cgenes%2CChip_unique_B73_ear_H3K27me3_rep1%2CChip_unique_B73_ear_H3K27me3_rep2%2CChip_unique_B73_root_H3K27me3_rep4%2CChip_unique_B73_root_H3K27me3_rep2%2CChip_unique_B73_root_H3K27me3_rep1%2CChip_unique_B73_root_H3K27me3_rep3%2CBED_tis_ear_mod_H3K27me3_group_broad_merged_0%2CBED_tis_leaf_mod_H3K27me3_group_broad_merged_1%2CBED_tis_root_mod_H3K27me3_group_broad_merged_2%2CChip_unique_B73_leaf_H3K36me3_rep1%2CChip_unique_B73_leaf_H3K36me3_rep2%2CChip_unique_B73_leaf_H3K4me1_rep1%2CChip_unique_B73_leaf_H3K4me1_rep2%2CChip_unique_B73_leaf_H3K4me3_rep1%2CChip_unique_B73_leaf_H3K4me3_rep2%2CChip_unique_B73_leaf_H3K56ac_rep1%2CChip_unique_B73_leaf_H3K56ac_rep2%2CBED_tis_leaf_rep_rep2_mod_H3K4me1_peaks_broad_region.filtered_5%2CBED_tis_leaf_rep_rep1_mod_H3K4me1_peaks_broad_region.filtered_3%2CBED_tis_leaf_rep_rep1_mod_H3K36me3_peaks_broad_region.filtered_4%2CBED_tis_leaf_rep_rep2_mod_H3K36me3_peaks_broad_region.filtered_6%2CBED_tis_leaf_mod_H3K4me3_group_narrow_merged_7%2CBED_tis_leaf_mod_H3K56ac_group_narrow_merged_8%2CSchmitzlab_RNA_data_B73_leaf_rep1%2CSchmitzlab_RNA_data_B73_leaf_rep2&highlight=

http://epigenome.genetics.uga.edu/JBrowse/?data=maize_v4&loc=6%3A37692121..37710060&tracks=rnas%2Cgenes%2CChip_unique_B73_ear_H3K27me3_rep1%2CChip_unique_B73_ear_H3K27me3_rep2%2CChip_unique_B73_root_H3K27me3_rep4%2CChip_unique_B73_root_H3K27me3_rep2%2CChip_unique_B73_root_H3K27me3_rep1%2CChip_unique_B73_root_H3K27me3_rep3%2CBED_tis_ear_mod_H3K27me3_group_broad_merged_0%2CBED_tis_leaf_mod_H3K27me3_group_broad_merged_1%2CBED_tis_root_mod_H3K27me3_group_broad_merged_2%2CChip_unique_B73_leaf_H3K36me3_rep1%2CChip_unique_B73_leaf_H3K36me3_rep2%2CChip_unique_B73_leaf_H3K4me1_rep1%2CChip_unique_B73_leaf_H3K4me1_rep2%2CChip_unique_B73_leaf_H3K4me3_rep1%2CChip_unique_B73_leaf_H3K4me3_rep2%2CChip_unique_B73_leaf_H3K56ac_rep1%2CChip_unique_B73_leaf_H3K56ac_rep2%2CBED_tis_leaf_rep_rep2_mod_H3K4me1_peaks_broad_region.filtered_5%2CBED_tis_leaf_rep_rep1_mod_H3K4me1_peaks_broad_region.filtered_3%2CBED_tis_leaf_rep_rep1_mod_H3K36me3_peaks_broad_region.filtered_4%2CBED_tis_leaf_rep_rep2_mod_H3K36me3_peaks_broad_region.filtered_6%2CBED_tis_leaf_mod_H3K4me3_group_narrow_merged_7%2CBED_tis_leaf_mod_H3K56ac_group_narrow_merged_8%2CSchmitzlab_RNA_data_B73_leaf_rep1%2CSchmitzlab_RNA_data_B73_leaf_rep2&highlight=

Jome0169 commented 4 years ago

http://epigenome.genetics.uga.edu/JBrowse/?data=maize_v4&loc=8%3A105965901..106072300&tracks=rnas%2Cgenes%2CChip_unique_B73_ear_H3K27me3_rep1%2CChip_unique_B73_ear_H3K27me3_rep2%2CChip_unique_B73_root_H3K27me3_rep4%2CChip_unique_B73_root_H3K27me3_rep2%2CChip_unique_B73_root_H3K27me3_rep1%2CChip_unique_B73_root_H3K27me3_rep3%2CBED_tis_ear_mod_H3K27me3_group_broad_merged_0%2CBED_tis_leaf_mod_H3K27me3_group_broad_merged_1%2CBED_tis_root_mod_H3K27me3_group_broad_merged_2%2CChip_unique_B73_leaf_H3K36me3_rep1%2CChip_unique_B73_leaf_H3K36me3_rep2%2CChip_unique_B73_leaf_H3K4me1_rep1%2CChip_unique_B73_leaf_H3K4me1_rep2%2CChip_unique_B73_leaf_H3K4me3_rep1%2CChip_unique_B73_leaf_H3K4me3_rep2%2CChip_unique_B73_leaf_H3K56ac_rep1%2CChip_unique_B73_leaf_H3K56ac_rep2%2CBED_tis_leaf_rep_rep2_mod_H3K4me1_peaks_broad_region.filtered_5%2CBED_tis_leaf_rep_rep1_mod_H3K4me1_peaks_broad_region.filtered_3%2CBED_tis_leaf_rep_rep1_mod_H3K36me3_peaks_broad_region.filtered_4%2CBED_tis_leaf_rep_rep2_mod_H3K36me3_peaks_broad_region.filtered_6%2CBED_tis_leaf_mod_H3K4me3_group_narrow_merged_7%2CBED_tis_leaf_mod_H3K56ac_group_narrow_merged_8%2CSchmitzlab_RNA_data_B73_leaf_rep1%2CSchmitzlab_RNA_data_B73_leaf_rep2%2Ckmer75_0_MM_genmap&highlight=

6/10/2020

Sent these regions to Bob, and he mentioned it's clear that many of these regions have no K4me1 region. Need to do a quick analysis of how many of these regions overlap a K36me3 peak, and only K36me3 peak. See if this groups 'leans' one way or another. If it appears to have only one type of enrichment, it will more than likely be worth describing in a few words.

Script to generate these regions is called initiation_overlapping_K27me3.sh . A quick little script that does some course bedtools intersect to ID regions of intitiaion not overlapping genes, and not overlapping K27me3 peaks. Generates the following file list:

ear_initiation_intersecting_H3K27me3.bed
ear_initiation_intersetion.bed
ear_initiation_not_intersecting_H3K27me3.bed
leaf_initiation_intersecting_H3K27me3.bed
leaf_initiation_intersetion.bed
leaf_initiation_not_intersecting_H3K27me3.bed
root_initiation_intersecting_H3K27me3.bed
root_initiation_intersetion.bed
root_initiation_not_intersecting_H3K27me3.bed

From this I'll also need to generate a quick ven diagram of the number of initiation mods that overlap the K27me3 and initiation peaks. Granted these numbers appear to be relativly limited. For instance:

❯ wc -l *initiation_intersecting_H3K27me3*
     901 ear_initiation_intersecting_H3K27me3.bed
     761 leaf_initiation_intersecting_H3K27me3.bed
    1212 root_initiation_intersecting_H3K27me3.bed

This isn't a bulk of the extra initiation regions we're finding - rather the bulk appear to be in this weird category of overlapping K36me3 and not K4me1.

To futher subdivide this data of initiaion only overlapping a single elongation region, going to intersect them with peaks of K4me1 and K36me3. This will probably suffer from the issue though of current peak calls using EPIC2, more than likely biasing broad peak calls to large regions, and missing on these small regions of enrichment. But lets' try doing this the way it stands initially before rewriting the snakefile to take intersections between epic2 and a MACS2 call for narrow inititiation regions.

Generate script initiation_overlapping_elongation_1_type.sh to generate a file list of intiation marks NOT overlapping GENES or H3K27me3 and asking the question - do these regions overlap elognation marks, and if so, of what species?

FIles output:

    581 leaf_initiation_overlapping_K36me3_only.bed
     527 leaf_initiation_overlapping_K4me1_only.bed
     645 leaf_initiation_overlapping_both_elong_mods.bed
    2847 leaf_initiation_overlapping_neither_elongation.bed
    4600 total

Overall those overlapping a single species looks good. But those in only the "neither elongation" class look like they are overlapping K36me3 peaks. But it's clear that epic2 is biasing these peak calls to larger regions - so have to merge a epic2 and MACS2 narrow peak call. ---sigh ---

Jome0169 commented 4 years ago

Remodeled broad peak calling step AGAIN - this time threw in a narrow peak call from MACS2. However the issue still persists, the H3k36me3 peaks are just too weak to call - so instead what I'm going to do is take those regions which are not overlapping a K4me1 Or K36me3 peaks, and generate metaplots based off these regions. Going to use a reference point center call, and use a quickly modified "generate_metaplot_script" to hopefully knock this out very quickly.

Directory to generate metaplots for these groups overlapping initiation: dir: /scratch/jpm73279/04.lncRNA/02.Analysis/03.metaplots_initiation_only_regions . Making metaplots of

root_initiation_only_no_gene: 00.data/bed_files/root_initiation_overlapping_neither_elongation.bed
leaf_initiation_only_no_gene: 00.data/bed_file/leaf_initiation_overlapping_neither_elongation.bed
ear_initiation_only_no_gene: 00.data/bed_files/ear_initiation_overlapping_neither_elongation.bed

To hopefully show that yeah - these regions are kinda interesting.

Also - something I've realized is the number still aren't perfectly adding up in terms of the number of initiation peaks, and where they're going. Tried to explain this with K27me3, as well as those not intersecting, but we're missing some. For example -

leaf_combined_side_by_side ven

Ven diagram shows 26550 peaks

leaf_upset.both_mods_req.pdf

Upset plots shows in total 17,000 GENES overlapping initiation mods - meaning there is 9,000 peaks unaccounted for. Only an additional 761 peaks can be explained by K27me3 overlaps, and an additional 1600 of them can be explained by either overlapping a K36me3/K4me1 mod, and an additional 2847 of them can be explained by this weird solo group as labeled above. So in total we can explain the location of an additional ~4894 peaks. But we're still missing about ~4000 peaks.

My hunch is that multiple genes are overlapping more than 1 peak. Different isoforms kinda thing. Did this, and it appears this will make up the missing peaks. This will be a quick description, and I need to write some sort of little script to gather all this information seemlessly.

Jome0169 commented 4 years ago

6/11/2020

KMeans cluster analysis

Spent quite a bit of yesterday analyzing the class of regions which are enriched for initiation and nothing else. Generated metaplots for these regions, and used a Kmeans approach to see if we can subdivide them into any sort of meaninful group. And... Kinda? It's honestly not very clear.

leaf_initiation_only_leaf_plot_kmeans_cluster2.pdf leaf_initiation_only_leaf_plot_kmeans_cluster3.pdf leaf_initiation_only_leaf_plot_kmeans_cluster4.pdf

There look like there migh be an interesting class here, but I'm kinda unwilling to dive into this anymore, for fear of getting off track. I think there's a clear example of there are some regions in the genome which appear to be rather small, and have slight enrichment of K36me3. But the metaplots don't really show the lack of K4me1 as anything super convincing. I will leave this be for the time being

Finalizing Figure 2 Generation

Working on getting a final draft of figure 2 worked out.

Alex made the point also that when you upload an UPSET plot in R - illustrator sees these regions as a werd box filled with an 'x' - this is caused because illustrator thinks it's a text element that doesn't exist in a specific font. So i had to go in and manually change these regions.

Figure_2_Final

Jome0169 commented 4 years ago

6/16/2020

Update Annotation Fix

Realized that upon running my update.py script, a ton of the chromosomes are coming back with no annotation fixes. So, something is wrong with the script, and this needs to be dealt with.

So, I have been putting this off for quite some time, but I think I need to write unit test for all of these functions. This is going to require generating a test set, and activly mining this area for the missing annotation, and the annotations that need to be fixed. This is going to take some time, but will make me sleep better at night.

Metaplot update

Shit was wrong with the metaplots I was generating. Sorting by gene expression wasn't working. Realized that I wasn't normalizing in BAM compare. So, altered this and re-ran the metaplot generation script bamCoverage -p {threads} --normalizeUsing CPM -bs 2 -b {input} -o {output} . Something I'm also trying to figure out is why the sorting by experssion has been so weird. RE-calculated TPM values in the correct way, but still, odd overall.

Screenshot possabilities

Merger Class: 5:89,564,148-89,597,745 4:112849401..112915850 (66.45 Kb) 4:48307101..48373550 (66.45 Kb) 2:97601601..97734500 (132.9 Kb) <- Currently going with this one

Novel Class: 10:97521321..97547900 (26.58 Kb)

Jome0169 commented 4 years ago

Issue regions to reservay: 10:62682226..62688415 (6.19 Kb)

6/22/2020

Yo what up - happy monday me. Spent this morning finalizing the update.py script, and overall it's working pretty well. Hitting some issues on SOrghum, where some of the annotations are already overlapping slightly, which makes the final merger step a bit tricky. Maybe this could be altered by instead of doing a merger, doing a a self intersect - requiring reciprocal overlap. Something I will figure out later.

But right now, program runs, and runs pretty well on Zea mays - chromosome 10. Test set looks good and all, so copied it over - ran on sorghum, and things look fine overall. Also used screenshots to generate a presentation for tomorrow to show these potential collaborators.

Metaplot Issues

I'm still unsure what's going wrong with the metaplot - I've decided to just generate a metaplot of those loci which Don't overlap K36me3 peaks and those which

6/24/2020

Working on Update.py ear runs

For some reason, when run on the cluster - the Update.py script appears to fail on all ear samples? Completly unsure why this is at the moment, so I went ahead and downloaded the ear chromosome 10 to see if I can figure out what's going wrong with this ear_chrom_10. Hopefully should be a simple fix - copy it over once the cluster is back up and re-run this.

Update- Odd, this seems to work when I run this locally. Sadly the cluster is down, so I will be unable to address this until later today when the cluster returns. Something that is still worth outputting here is a verbose intermediate output command - where I would basically output all intermediate files of importance that would more easily allow me to hone in on the issues that this script sometimes encounters.

Metaplot Issues Cont'd

Metaplot is still suffering from the same issues as outlined above. I am trying to deal with this in two ways

1) Currently I am going to investigate whether increasing the "mappability" survival score had the intended effect of decreasing the number of Loci with > 1 TPM that didn't have any signal of H3K36me3 in the metaplot.

2) If this doesn't work, select genes by those which are overlapping H3K36me3 peaks, then do a similar sort to see if this fixes the issues.

Jome0169 commented 4 years ago

6/29/2020

Ear Update Issue

Realized that the issue ear tissue was having was that there were too many RNA-seq reads from ear tissues - maxing out at 82 GBs of RNA-seq reads. So going to subsample these reads, and re-run this quickly. Will subsample to the size of leaf tissue which is in total 92093946 reads.

Total number of reads in ear: 261526518. So just about .08% subsample which is

seqtk sample -s100 RNA_B73_tis_ear_all.fastq .3 > RNA_B73_tis_ear_subsample.fastq

jpm73279@sapelo2-sub5:00.data/rna/raw
▶ rm RNA_B73_tis_ear_all.fastq

jpm73279@sapelo2-sub5:00.data/rna/raw
▶ mv RNA_B73_tis_ear_subsample.fastq  RNA_B73_tis_ear_all.fastq

Metaplot Issue

Going back to the above mentioned metaplot issues, this has been resolved. Overall the main issues I realized which was occurring during the metaplot sorting was that the order was not actually being conserved in the expression ordering. This too me basically a full day to resolve irritatingly enough, and I was able to trace back the issue to the fact that I was sorting genes names lexicographical rather than in the order assigned to them by expression level.

So, I kept gene names during the merging of the on/off gene set, and huzzah - this worked. And only took me a super long time to fix.

Jome0169 commented 4 years ago

6/29/2020 Cont'd

Last week finally got the section headers taken care of in regards to images. The main thought here being that every annotation type which is going to be fixed is going to have example of what that data looks like on a genomic scale.

Screen Shot 2020-06-29 at 9 07 46 AM

This has now also established the colors we will be using for before and after annotation fix.

Before Color Hex: #3e4e9d After Color Hex: #888a8d

So, now the thought is that I will plot the density graph of annotations before and after underneath of the graph. However, I need to now decide whether to only use annotations that are supported by re-assembly with RNA-seq, or all assemblies. Right now I'm thinking those assemblies which actually worked, which would require running the scripts in /scratch/jpm73279/04.lncRNA/02.Analysis/34.generate_updated_annotations .

But before I'm able to do this, there are a few things that still need to be fixed

1) Merge issues need to be worked out. Right not I'm still merging gene features that should clearly not go together. This happens both before, and after the final merge, making me think something upstream is broken. Really wish the verbose option was working here.

2) Splitting output files has not been looked at in enough detail. I need to look at splits to make sure we're categorizing the file splits correctly. For instance Possible_novel_gene,gene:Zm00001d025749 would be categorized into the "merge" class of genes when it should be a probably extension, or two serperate genes due to the issues outlined above.

Jome0169 commented 4 years ago

Regions to check for merge command: 10:135665751..135708250

6/30/2020

Realized yesterday that a big issue of the final step is the merger. The merger presumes that no genes will be overlapping one another which is incorrect, and a horrible assumption on small genomes. To fix this, I have employed group_by functions for the sense and anti-sense genes in my script. Hopefully it should work pretty flawlessly

Screen Shot 2020-06-30 at 9 24 01 AM (42.5 Kb)

** Finished this update - thinking of calling the script pretty much good at this point. Going to re-run everything, for both maize, and sorghum, and hopefully get around to the re-running of the transcript assembly step of things.

Seperate Script

Something I realized was an issue is that the seperation script after the annotion update is incorrect. This is because this script, name seperate_into_groups.py has a hard coded element in it called gene: in which is presumes.

This is a quick fix that I already solved. Added a second argument that is the "basea_name" for all genes. For instance sorghum is 'sobi'. And damnnnnn - these are looking so clean now.

Going to quickly re-run these on everything else.

Annotations Supported Yes or No

So, a while ago I generate a script to look at the proportion of various annotation types which are backed up by either RNA-seq or ISO-seq data. This analysis in full can be found in here:

dir/scratch/jpm73279/04.lncRNA/02.Analysis/34.generate_updated_annotations script: Assemble_updated_annotations.snake outputs: If different annotation findings are backed up or not.

The way this script works is pretty basic. For the RNA-seq tissues (leaf,root,ear) assmebles the RNA-seq track using stringtie, and sees if any of the annotations are fully supported by a newly generate transcript.

For iso-seq, it just looks for a read that spans the entire length of the transcript. These are then read into the script Annotation_support_bar_chart.R. Which generates the below bar chart. Overall, I'm not loving what I'm seeing in regards to the number of unsupported hyper large regions there are. A huge bulk, like 90% are unsupported, making me think there's clearly an issue which needs to be addressed. But before going back and fiddaling with Update.py more, I want to look over the above snakescript and see if some of the choices I made are logical.

Screen Shot 2020-06-30 at 2 28 42 PM

Jome0169 commented 4 years ago

I'm not sure what to do about the above issues as the "hyper large' gene class is defined. These genes have a massive issue with potential missed peak calling - so I'm unsure how to go forward and do anything to fix them.

Screen Shot 2020-06-30 at 5 31 22 PM

So, the big issue with hyper large genes I think is that the extension downstream is generally incorrect. Often times these marks of elongation push pretty far after where the peak appears to naturally end.

Problem Area: 10 39109229 39113847

For example: Screen Shot 2020-07-01 at 8 24 24 AM

This is extended too far, even when there is no RNA_seq read support for that leftover sections.

Ended up correcting this by needing a certain number (20) RNA_seq reads supporting a left over region used in extension. Also updated code so that intersection/merger of overlapping promoters can be further classified into one of the "annotation classes" based off how much greater the promoter mark is than the current TSSs.

Re-running everything... again -_-

Jome0169 commented 4 years ago

Regenerating Ven Diagrams

Currently, the snakemake pipeline is designed to generate the ven diagrams for overlapping annotation types between tissues. This works reasonably well, but the ven diagram is nowhere near being good enough for publication. So - I am going to have to rewrite it as per usual.

Also - moving forward - going to write everything together after this most recent batch is finished running through Assemble_updated_annotations.snake . Even if things must be re-run, that's fine. But, for right now I would like to process these output files more heavily, stitch them together, and generate the figure which will go into the actualy paper.

Right now - the plan is that regions of identical annotation classes between tissues will be coutned before assembly, and the updated distribution charts will focus on those which passed the assembly phase. Unsure where to put the ISO-seq vs RNA-seq validation stuff. Maybe somewhere else for the time being?

After update - values look like this Screen Shot 2020-07-01 at 2 14 47 PM

Jome0169 commented 4 years ago

7/2/2020

Generate Ven Diagrams of Annotation overlap between tissue types

Worked yesterday afternoon and got the ven-diagrams for overlapping annotation types found between tissues purring. These files are stored in /Users/feilab/Projects/03.ncRNA_project/03.Figures/Figure3_5/imgs

Ven diagram counts are generated by the files generate_ven_diagrams_annotation_counts.py which calculates the counts shared between bed files, and generate_ven_counts.sh which runs the above python script for every set of fixed annotation types. Example of command run in generate_ven_counts.sh:

python generate_ven_diagrams_annotation_counts.py -bed1 \
00.data/counts_each_tis_bed/ear_annotation_novel_genes.bed \
00.data/counts_each_tis_bed/leaf_annotation_novel_genes.bed \
00.data/counts_each_tis_bed/root_annotation_novel_genes.bed \
-header_name ear leaf root > 01.generate_annotation_ven_diagrams/novel_counts.txt

Rscript to generate figures: generate_ven_diagram_annotation_overlaps.R found in dir: /Users/feilab/Projects/03.ncRNA_project/MendietaPablo_Annotation_Paper_scripts/R_scripts

Bed files used in these ven diagrams are the set NOT validated by transcriptome information. These bed files are originally from the dir: 2020-07-01_counts_annotations_each_tis .

generate_dist_chart_beds

Distribution Chart before/after annotation re-assembly

So, to get this going, I have to grab the

Well - I fucked up in a big god damn way. Messed up the way I was inputting files into the "validation" step using ISO-seq / RNA-seq. Tons of these annotations I'm calling are actually not validated at all. So that's fucking awesome.....

Screen Shot 2020-07-01 at 2 14 47 PM

Jome0169 commented 4 years ago

7/3/2020

Fixing supported/unsupported

Spent yesterday looking at the unsupported class. What this class actually looks like is below.

UpdateAnnotation_support

So overall a much larger group of unsupported loci, which is NOT what I wanted to see for obvious reasons. The reason I didn't catch this error sooner was because I was incorrectly reading in the list of supported and unsupported extensions/annotation classes in to the R script annotation_update_support_isoseq_RNAseq.R. After correcting this mistake, it's clear there are many more unsupported annotations.

However, spending time and going back to look at these calls using this command to generate the list of failing genes

cat WGS_passed_failed_list/major_extension_genes_failing_list.txt iso_seq_passed_failed_list/major_extension_genes_failing_list.txt | sort | uniq -c | awk '$1 == 2 {print $2}' > major_extension_failed_both.txt

python pull_list_bed.py major_extension_failed_both.txt 00.data/bed_files/major_extension_genes.renamed.bed > true_failed.bed

Many of the calls do look really supported, for example Screen Shot 2020-07-03 at 8 33 03 AM Screen Shot 2020-07-03 at 8 32 53 AM Screen Shot 2020-07-03 at 8 32 18 AM

Which makes me thing the real issue is how I'm calling a 'good' update and a 'bad' update. Originally I was expecting the corrected annotation to span the entire length of the bed interval called. This is incorrect as you can see from the screenshot above. So, I have to fix this, and go back to the assembly directory section /scratch/jpm73279/04.lncRNA/02.Analysis/34.generate_updated_annotations/Assemble_updated_annotations.snake and edit the way that analysis is done.

Somehow have to compare just the old annotations versus the new annotations and see if the new annotation is actually larger than the old annotation. Fun stuff, fun stuff.

Generating Density DIagrams

Also spent yesterday getting the density diagrams re-generated. The way this works is the script generate_dist_chart_beds.sh pulls the before/after annotation corrections that worked using the passing list from the directory 00.data/passing_failing_updates/WGS_passed_failed_list/. These are then read into the R script: generate_before_after_denisty_charts.R, and generates the diagrams below.

Screen Shot 2020-07-03 at 8 38 10 AM

So overall, looks nice, and will look even better after some cleaning up in illustrator. One issue to fix currently is the 'minor extension' class if failing for some reaoson. Unsure why this is.

Jome0169 commented 4 years ago

7/6/2020

Working on finishing a script which takes in the information about transcripts generated from WGS data, and compares the assemblies with updated parameters versus those that existed before annotation correction. If the updates assemblies are > the old assembleies, we pass those, and output their updated coordinates as outline by the assemblies.

Right now this script doesn't work on the 'merged' class, and it also appears to be dropping a few loci which didn't assemble a transcript at all.

This has now been fixed. This script works on both the iso_seq and the WGS data. Looks like it worked pretty well overall, the graph of supported features looks far more reasonable than it previously did. So, I will call that successful.

UpdateAnnotation_support

Now, when it comes to plotting the diestrubutions, I now have the issue of having both and ISO-seq BED files with updated coordinates, as well as a WGS updated coordinates file. I guess the only way I can think of doing this is if the LOCI in question has a transcript in either WGS or ISO-seq, use that length. If it doesn't have a length in either pass, and if it has both.... well... I'm less sure what to do about this. Mean length could work? I think that's what I will do.

One other issue here is tha in the distribution charts, I don't have any novel distribution chart as of yet. I think I will likely need to go back and correct Assemble_updated_annotations.snake to work on novel regions independently. Basically not using the compare_old_to_new annotations, as there are clearly not old annotations to compare to. I'll also need to make some sort of smaller call in R to pull a random list of genes from maybe a few different quantiles to see where these novel genes stack up in length in comparison to the other classes.

log2_extension_class log2_merged_class

Updates distribution charts. Look good. Just need to generate one for the novel annotations, and push that through. Also need to generate the before/after metaplots.

Also, now need to focus on generating the list of novel regions which have passed. These are going to require different thresholds than the previous annotations for reasons mentioned above.

Jome0169 commented 4 years ago

7/7/2020

Realized that the was I was naming novel regions, wasn't going to work due to the way in which I was generating a key file which required a unique input name. Right now all novel regions are named identically with Possible novel regions . So to get the actual bounds of these regions which are passing/supported, I had to rewrite both the rename script, as well as the compare novel regions/intersetion/update script.

novel rename script" rename_novel_beds.py compate_novel_region_assemblies: compare_novel_assemblies_vs_novel_region_annotations.py

python /Users/feilab/Projects/03.ncRNA_project/MendietaPablo_Annotation_Paper_scripts/python_scripts/rename_novel_beds.py -bed novel_genes.bed -base novel_region -o test.renamed.bed

python compare_novel_assemblies_vs_novel_region_annotations.py -key_file 00.data/test.novel.key.txt -new_annot 00.data/novel_genes.intersect.new.assemblies.bed -inter_type gtf -o test_NOVEL

Manage to fix this without having to write a ton of stuff by adding an if statment to the snakemake pipeline as pointed out below

Screen Shot 2020-07-07 at 10 13 41 AM

Easy peasy

Before/After metaplots

In the process of generating the before/after metaplots, came upon an irritating realization. It looks like the filter step I generated in the metaplot command requries eitehr leaf or root or ear to be in these file names. This is only problematic because these before/after annotation update files are NOT tissue specific anymore. These are the files I used i

Jome0169 commented 4 years ago

7/8/2020

Metaplots

Generated metaplots yesterday of the before/after annotation update. They look... odd? I'm having a hard time placing exactly what is happening here. But overall it looks like the annotations are "better" in regards to the K4me3 and the K56ac. But in regards to the elongation marks, it looks pretty odd right? It looks like there is less blead over in the TES, but there is still quite a bit of upstream elongation mods. There is also a clear issue with the scaling which is why K4me3/56ac take over the length of a gene body region. Probably a relatively small region overall.

Screen Shot 2020-07-07 at 3 37 05 PM

Going to spend some time working on seeing if generating plots focusing on just the TSS and TES if that looks better, and doesn't have the issues of stretching these regions too heavily.

Talked to bob, he says overall looks pretty good, possibly sort them by size which could work. But right now, going to work on getting the line width at the top increased, and move on from that .

Jome0169 commented 4 years ago

7/9/2020

Putting together figures

Increased line size in metaplots, and converted them to PDFs for actual import.editing in illustartor. Also

Finishing species runs

I've been having issues finishing the asparagus genome run, as when I try and finalize it during the gather_chrom step, I'm greated with the error Argument list too long . Appears that unix has a hard time dealing with ~20,000 arguments. Frusturating because I already chunked this down and filtered down scaffolds <20000 . So, looks like I'll just increase stringency to scaffoles greater than 100000 so this can actually complete its run.

Getting B dichanis genome running - pulling genes from GTF

awk '{print $3}' Bdistachyon_556_v3.2.gene.gff3 | sort | uniq

CDS
distachyon
five_prime_UTR
gene
mRNA
three_prime_UTR
Jome0169 commented 4 years ago

7/10/2020

Today I'm planning on finishing some figures up, and generating the final set of counts for the multiple species annotation stuff. Overall the stapled together figures are looking pretty nice.

mkdir -p 00.data/{bed_files,chip_seq/{H3K36me3,H3K4me1,H3K4me3,H3K56ac,input},rna/{raw},reference}

Jome0169 commented 4 years ago

7/13/2020

Something I didn't write very well about on Friday was I got the counts of different annotation classes found in different species down. Generated a script to read in the annotation count files, and run them. Currently named: annotation_counts_various_species.R. This figure works in the directory dir"/Users/feilab/Projects/03.ncRNA_project/03.Figures/Figure_6 and expects the given type of input directory.

Screen Shot 2020-07-13 at 8 37 38 AM

Counts of annotation issues multi_species_counts

Looking at the counts of various annotations above, it's clear that asapargus has by far the worst genome. This makes sense given that it's unpublished, and doesn't look to be a very polished genome.

Jome0169 commented 4 years ago

Final figures looking sharrrp Final_extension Final_merger Final_novel

Jome0169 commented 4 years ago

7/14/2020

Working on generating a final figure 6. Currently I have the number of incorrect annotations of each class. One thought is doing number of genes on X-axis vs number of correction on Y axis. Doubt this will yiled anything interesting though. When in reality what I'm sure I'll find is that rather

▶ find . -path "*00.data/bed_fi*/*.bed" -exec sh -c "wc -l {}" \;
27433 ./03.P_vulgaris/00.data/bed_file/Pvulgaris_442_v2.1.gene.bed
52872 ./01.glycine_max/00.data/bed_files/Gmax_508_Wm82.a4.v1.gene.bed
27991 ./05.A_officialnalis/00.data/bed_files/AsparagusCHR_V1.1.gene.bed
32439 ./07.B_dichanis/00.data/bed_files/Bdistachyon_556_v3.2.gene.bed
35441 ./04.S_bicolor/00.data/bed_files/Sbicolorv5.1.gene.bed
38334 ./02.S_verdis/00.data/bed_file/Sviridis_500_v2.1.gene.bed

made a couple of graphs looking a host of stuff. Namely looking at genome_size_vs_annotation_error_all_counts genome_size_vs_annotation_error_by_type multi_species_counts

gene_number_vs_annotation_error_by_type gene_number_vs_annotation_error

Jome0169 commented 4 years ago

7/15/2020

Just talked to bob about what to do next as I am stuck on these figures. He made the point of making a figure about these super large genes in the maize genome - and see what their deal is.

So, some thoughts - pull out the top 1% of super large annotated genes and figure out what the deal is with them. Are they in single copy? Do they just have a ton of TE proliferations in them, and if so, which family? Across species analysis here might also be interesting.

Also a possibility is looking at hte TE-Iness of some of the novel gene class. Are these novel regions more TE like than normal genes, and is this unique? Issue being this will hit on a bunch of stuff already done by brandon gauts group in California.

Analysis on top 1% or top .01% of large genes in maize

1) - [x] Generate a finalized updated annotation with the bed files of all of my corrected loci, and all of the non-corrected loci. Make either a R script, or python script to pull out that top 1% and see what proportion of these genes are ones that I picked up on. 2) - [ ] Once these genes are isolated - go through, retrieve GTF file sequences for them (which is going to be a huge pain in the ass), and start comparing attributes such as a) - [ ] gene length with introns as compared to other genes b) - [ ]Length of gene with no introns compared

Are these significant differences from different quartiles of other large genes in maize?

3) - [ ] Possible check conservation? So - pull out these genes, take only the exons, and start blasting them to other species. Do we find these genes in sorghum? Yes? No?

Since I'm unsure if this will be its own figure, or if it will be an attachment to figure 6 (doesn't really seem very logical to me), I'm going to generate a new folder in my figure directory called dun dun dun DUNNNN Figure_7

Generating a distribution of annotations

Generate a quick set of annotations from the original V4.38 which were not modified using a simple bedtools command found in the script generate_non_updated_list.sh

❯ cat generate_non_updated_list.sh
bedtools intersect -a Zea_mays.AGPv4.38.chr.all.bed \
-b hyper_large_updated.bed \
major_extension_updated.bed \
merged_updated.bed \
minor_extension_updated.bed \
-v -s -wa > Zea_mays.AGPv4.38.un_modified_genes.bed

Loaded all the bed files into the script gene_size_analysis\ .R and generated this violin plot looking at the BED gene sizes (only start and stop) as compared to the annotations which were not altered. Distribution looks like this:

disiribution_of_annotation_lengths

Kinda dissapointing that some of the super large ones I was originally annotating didn't show up, or at least didn't appear to show up. Probably worth a cursory glance to see where these went. For instance I KNOW there was a gene that was looking like 190 KB on chromosome 10 that I consistenly looked at. Fairly confident thought that I may have weeded this out in the processes where I was setting distance criterion for when the algorithm was searching for promoter like sequences.

Confident however that it's not worth venturing back considering how much work we've done already, as well as this would require re-generating alot of my figures.

What proportion of the top 5% of largest genes come from me? Top 1%?

To do this, going to assign these genes based off of their genes lengths, and sort by the top 1%. Generate a bar chart of the proportion each annotation classes, and turns out, a large proportion of these ultra large genes I found are in that top 1% of genes by size.

proportion_of_top_1_percent

Jome0169 commented 4 years ago

Going over my top performers, I'm finding alot of worrying trends. A bunch of things look like they're just errors which have somehow squeezed by my error catching process. For instance,

Screen Shot 2020-07-16 at 11 02 27 AM Screen Shot 2020-07-16 at 11 03 07 AM

http://epigenome.genetics.uga.edu/JBrowse/?data=maize_v4&loc=9%3A96769598..96862756&tracks=rnas%2Cgenes%2CChip_unique_B73_leaf_H3K36me3_rep1%2CChip_unique_B73_leaf_H3K36me3_rep2%2CChip_unique_B73_leaf_H3K4me1_rep1%2CChip_unique_B73_leaf_H3K4me1_rep2%2CChip_unique_B73_leaf_H3K4me3_rep1%2CChip_unique_B73_leaf_H3K4me3_rep2%2CChip_unique_B73_leaf_H3K56ac_rep1%2CChip_unique_B73_leaf_H3K56ac_rep2%2CBED_tis_ear_mod_H3K4me1_group_broad_merged_chrom_10_0%2CBED_tis_ear_mod_H3K4me3_group_narrow_merged_chrom_10_1%2CBED_tis_ear_mod_H3K56ac_group_narrow_merged_chrom_10_3%2CGTF_hyper_large_genes_assembly_merged.gtf_0%2CBED_tis_ear_mod_H3K36me3_group_broad_merged_chrom_10_2%2CGTF_major_extension_genes_assembly_merged.gtf_1%2CGTF_merged_genes_assembly_merged.gtf_2%2CGTF_minor_extension_genes_assembly_merged.gtf_3%2CGTF_novel_genes_assembly_merged.gtf_4%2CBED_top_001_percent_genes_5%2CBED_hyper_large_updated_6%2CBED_major_extension_updated_7%2CBED_minor_extension_updated_8&highlight=

All of which are super frusturating. So this tells me a few things - 1) the annotation script really isn't working as well as I was hoping it would by this point which is incredibly upsetting, and 2) My assembly/validation script is also not working for some reason. It's saying that a bunch of regions passed the GTF section which didn't in fact.

I think at this point I might need to fix the annotation script which is really irritating. The only way I can think of really doing that would be to either 1) go back to peak calling which would make me blow my brains out, or two, attempt to program in some type of utilization of bedgraphs or bigiwigs to check that when I'm extending a gene, there is support of elongation throughout the gene body, and I'm not just reaching into nothinginess. This is going to be a huge pain.

This is going to require two fixes - arguably three fixes.

Interesting - looks like the error catching in the passing/failing is actually working pretty well - or at very least it's doing a good job updating the length of the bed file for where it's backed up by the GTF file. So - maybe something is then incorrect in the way it's handeling either the ISO-seq intersect data in bed format - or I pulled the correct regions incorrectly.

Realized I was taking the incorrect set of "updated" bed files. In the distribution chart about I took the mean transcript size between ISO-seq and RNA-seq assemblies, and never wrote them to a bed file. So I was reading in the bed file of non - backed up updated regions.

SECTION 1 IS DONE. FIXED. GRAPH IS REGENERATED.

Jome0169 commented 4 years ago

7/18/2020

Had our meeting yesterday and got some recommendations about possible other analysis to do. People were saying I could do a bunch of evolution stuff that honestly just doesn't sound that interesting, and would take quite a bit of work. But bob has a good idea that I liked.

Taking the list of ultra large genes - look and see if they have a higher rate than normal of two/three ATAC-seq peaks within the gene. Do ultra large genes have some sort of regulatory aspect within them as compared to smaller genes? Should be pretty testable as well considering we have the ATAC-seq.

7/20/2020

ReRan everything in maize - but realized I was getting issues when re-assembling regions due to lines such as Possible_novel_gene_major_extended,gene:Zm00001d030572_major_extended_hyper_large being passed to the merged genes regions. So I need to fix how these regions are cateforized in the script seperate_into_groups.py and as well compare_old_new_annotations.py

Fix should be pretty simple - comprising a simple boolean check, and possibly a counter f the number of times "novel" pops up in the string list in seperate.py.

FIXED. EASY.

7/23/2020

Distribution charts have been fixed, and the coordinates have been changed. Ven diagrams have been updated.

Error has been calculated in the form of passing/failing ISO-seq/WGS assembly. Stil the largest issue is minor extesnsion genes, but that can't really be helped based off the way the algortihm works. Overall I'm happy with it, and going to leave it as it stands.

UpdateAnnotation_support

In the process of re-generatings metaplots found another error. Apparently the grep commands used in my scripts. For whatever reason the command grep -wf has just decided to not work like it was. Can't tell if this is an issue in the conda env or not. Going to replace it with a script.

  1 set -euxo pipefail
  2
  3 ### HYPER LARGE ####
  4 #pull passing list
  5 python pull_list_bed.py 02.generate_distribution_charts/hyper_large_genes.passed.txt \
  6 00.data/counts_each_tis_bed/leaf_annotation_hyper_large_genes.bed \
  7 00.data/counts_each_tis_bed/leaf_annotation_hyper_large_genes.bed > 03.metaplots_before_after/leaf_passing_after_hyper_large_genes.bed
  8
  9 #Reformat key file only passing KEYs. Silly but I had to do this quikc.
 10 python pull_keys_from_bed.py \
 11 03.metaplots_before_after/leaf_passing_after_hyper_large_genes.bed \
 12 02.generate_distribution_charts/hyper_large_genes.passed.txt > 03.metaplots_before_after/leaf_hyper_large_genes.key.txt
 13
 14
 15 #Pull original annotations
 16 python pull_passing_bed.py 03.metaplots_before_after/leaf_hyper_large_genes.key.txt \
 17 00.data/bed_files/Zea_mays.AGPv4.38.chr.all.bed \
 18 00.data/bed_files/hyper_large_genes.renamed.bed \
 19 03.metaplots_before_after/leaf_hyper_large_metaplots
 20

Ended up writing a bunch of little scripts that I at least know work. Damn grep - I swear it has the most variable behavrior. Anyways metaplots are re-running, and should be fairly quick once these are done. This means I will now be moving onto the "large" gene region analyais.

Analysis of Super Large Genes in the Maize Genome

So - now we finally get to the point where I'm able to start analyzing the super large genes in the maize genome. I think the first analysis will be the intron size of these regions, and do some inital descriptive stats on them. Asking questions like - are all introns equally big? Do these genes just encode huge ass proteins, rather than just having a super large intron?

Steps

Jome0169 commented 3 years ago

So - I’m not sure the internal regulatory analysis is going to work. Many of the genes which have these super large introns don’t seem to have much going on in regards to internal accessibility. A couple of examples 12:57 http://epigenome.genetics.uga.edu/JBrowse/?data=maize_v4&loc=10%3A39057251..39146400&tracks=rnas%2Cgenes%2CChip_unique_B73_leaf_H3K36me3_rep1%2CChip_unique_B73_leaf_H3K36me3_rep2%2CChip_unique_B73_leaf_H3K4me1_rep1%2CChip_unique_B73_leaf_H3K4me1_rep2%2CChip_unique_B73_leaf_H3K4me3_rep1%2CChip_unique_B73_leaf_H3K4me3_rep2%2CChip_unique_B73_leaf_H3K56ac_rep1%2CChip_unique_B73_leaf_H3K56ac_rep2%2CRNA_B73_leaf_Aligned.sortedByCoord.out%2CB73Leaf1%2CB73Leaf2%2CB73Root1%2CB73Ear1%2CB73Leaf-Final%2CB73Ear-Final%2CB73Root-Final&highlight= 12:57 http://epigenome.genetics.uga.edu/JBrowse/?data=maize_v4&loc=6%3A36044601..36133750&tracks=rnas%2Cgenes%2CChip_unique_B73_leaf_H3K36me3_rep1%2CChip_unique_B73_leaf_H3K36me3_rep2%2CChip_unique_B73_leaf_H3K4me1_rep1%2CChip_unique_B73_leaf_H3K4me1_rep2%2CChip_unique_B73_leaf_H3K4me3_rep1%2CChip_unique_B73_leaf_H3K4me3_rep2%2CChip_unique_B73_leaf_H3K56ac_rep1%2CChip_unique_B73_leaf_H3K56ac_rep2%2CRNA_B73_leaf_Aligned.sortedByCoord.out%2CB73Leaf1%2CB73Leaf2%2CB73Root1%2CB73Ear1%2CB73Leaf-Final%2CB73Ear-Final%2CB73Root-Final&highlight= 12:57 http://epigenome.genetics.uga.edu/JBrowse/?data=maize_v4&loc=10%3A74037301..74126450&tracks=rnas%2Cgenes%2CChip_unique_B73_leaf_H3K36me3_rep1%2CChip_unique_B73_leaf_H3K36me3_rep2%2CChip_unique_B73_leaf_H3K4me1_rep1%2CChip_unique_B73_leaf_H3K4me1_rep2%2CChip_unique_B73_leaf_H3K4me3_rep1%2CChip_unique_B73_leaf_H3K4me3_rep2%2CChip_unique_B73_leaf_H3K56ac_rep1%2CChip_unique_B73_leaf_H3K56ac_rep2%2CRNA_B73_leaf_Aligned.sortedByCoord.out%2CB73Leaf1%2CB73Leaf2%2CB73Root1%2CB73Ear1%2CB73Leaf-Final%2CB73Ear-Final%2CB73Root-Final&highlight= 12:58 http://epigenome.genetics.uga.edu/JBrowse/?data=maize_v4&loc=10%3A39054351..39143500&tracks=rnas%2Cgenes%2CChip_unique_B73_leaf_H3K36me3_rep1%2CChip_unique_B73_leaf_H3K36me3_rep2%2CChip_unique_B73_leaf_H3K4me1_rep1%2CChip_unique_B73_leaf_H3K4me1_rep2%2CChip_unique_B73_leaf_H3K4me3_rep1%2CChip_unique_B73_leaf_H3K4me3_rep2%2CChip_unique_B73_leaf_H3K56ac_rep1%2CChip_unique_B73_leaf_H3K56ac_rep2%2CRNA_B73_leaf_Aligned.sortedByCoord.out%2CB73Leaf1%2CB73Leaf2%2CB73Root1%2CB73Ear1%2CB73Leaf-Final%2CB73Ear-Final%2CB73Root-Final&highlight= 12:58 http://epigenome.genetics.uga.edu/JBrowse/?data=maize_v4&loc=2%3A233983853..234072370&tracks=rnas%2Cgenes%2CChip_unique_B73_leaf_H3K36me3_rep1%2CChip_unique_B73_leaf_H3K36me3_rep2%2CChip_unique_B73_leaf_H3K4me1_rep1%2CChip_unique_B73_leaf_H3K4me1_rep2%2CChip_unique_B73_leaf_H3K4me3_rep1%2CChip_unique_B73_leaf_H3K4me3_rep2%2CChip_unique_B73_leaf_H3K56ac_rep1%2CChip_unique_B73_leaf_H3K56ac_rep2%2CRNA_B73_leaf_Aligned.sortedByCoord.out%2CB73Leaf1%2CB73Leaf2%2CB73Root1%2CB73Ear1%2CB73Leaf-Final%2CB73Ear-Final%2CB73Root-Final&highlight= 12:59 And here is a region just showing that the ATAC-seq i’ve overlaid is actually good http://epigenome.genetics.uga.edu/JBrowse/?data=maize_v4&loc=2%3A16253539..16527846&tracks=rnas%2Cgenes%2CChip_unique_B73_leaf_H3K36me3_rep1%2CChip_unique_B73_leaf_H3K36me3_rep2%2CChip_unique_B73_leaf_H3K4me1_rep1%2CChip_unique_B73_leaf_H3K4me1_rep2%2CChip_unique_B73_leaf_H3K4me3_rep1%2CChip_unique_B73_leaf_H3K4me3_rep2%2CChip_unique_B73_leaf_H3K56ac_rep1%2CChip_unique_B73_leaf_H3K56ac_rep2%2CRNA_B73_leaf_Aligned.sortedByCoord.out%2CB73Leaf1%2CB73Leaf2%2CB73Root1%2CB73Ear1%2CB73Leaf-Final%2CB73Ear-Final%2CB73Root-Final&highlight= 1:02 I have noticed a trend where some do appear to have more downstream peaks. http://epigenome.genetics.uga.edu/JBrowse/?data=maize_v4&loc=5%3A88485921..88521580&tracks=rnas%2Cgenes%2CChip_unique_B73_leaf_H3K36me3_rep1%2CChip_unique_B73_leaf_H3K36me3_rep2%2CChip_unique_B73_leaf_H3K4me1_rep1%2CChip_unique_B73_leaf_H3K4me1_rep2%2CChip_unique_B73_leaf_H3K4me3_rep1%2CChip_unique_B73_leaf_H3K4me3_rep2%2CChip_unique_B73_leaf_H3K56ac_rep1%2CChip_unique_B73_leaf_H3K56ac_rep2%2CRNA_B73_leaf_Aligned.sortedByCoord.out%2CB73Leaf1%2CB73Leaf2%2CB73Root1%2CB73Ear1%2CB73Leaf-Final%2CB73Ear-Final%2CB73Root-Final&highlight=

Jome0169 commented 3 years ago

09/02/2020

Been working on writing the results up for the last few weeks. One analysis I realized I didn't do was looking at the merged gene class. One value that still needs to be reported is the total number of genes in which occur in the average gene merger.

Location: /Users/feilab/Projects/03.ncRNA_project/03.Figures/Figure3_5/04.merged_gene_class_analysis R_script_name: merged_gene_class_analysis.R

Realized dealing with the string names was going to be a pain in the but in R - so instead made a quick python script to read in the bed file of merged gene names, take the names, clean them of any additional modification created from my updated.py script (such as _minor_extended), and output just a comma seperate list of the two names.

Script command: python ../merged_class_metrics.py merged_genes.bed > merged_genes.cleaned_names.txt

Output:

gene:Zm00001d031870,gene:Zm00001d031868
gene:Zm00001d031956,gene:Zm00001d031957
gene:Zm00001d031994,gene:Zm00001d031993
gene:Zm00001d032029,gene:Zm00001d032030
gene:Zm00001d032170,gene:Zm00001d032168

Final tally of numer of genes merged in each grouping;

summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 3
  number_merged_genes     n   freq
                <dbl> <int>  <dbl>
1                   2   357 0.983 
2                   3     6 0.0165

The last quick analysis I now want to do is look at the proprtion of gene pairs/triplets which have identical GO names or gene descriptions. Possibly pointing to the fact that these are all single genes.

command to isolate gene_IDs with descriptions: bioawk -c bed '{print $name, $blockcount}' Zea_mays.AGPv4.38.chr.all.bed > Zea_mays.AGPv4.38.gene_name.description.txt

gene:Zm00001d009275     ID=gene:Zm00001d009275;biotype=protein_coding;logic_name=maker_gene;gene_id=Zm00001d009275
gene:Zm00001d009276     ID=gene:Zm00001d009276;biotype=protein_coding;logic_name=maker_gene;gene_id=Zm00001d009276
gene:Zm00001d009277     description=Retrovirus-related Pol polyprotein LINE-1;ID=gene:Zm00001d009277;biotype=protein_coding;logic_name=maker_gene;gene_id=Zm00001d0092
77

Loaded all files into the R script: Turns out 30% of merged genes have identical descriptions in their BED filesm or what we would consider their "GO TERM"

# A tibble: 3 x 3
  identical_descriptions     n  freq
  <lgl>                  <int> <dbl>
1 FALSE                    154 0.424
2 TRUE                     108 0.298
3 NA                       101 0.278
Jome0169 commented 3 years ago

9/25/20

Generation of Toy Metaplots

Going to add a simple metaplot around my toy transcribed gene exmaple below. Red box denotes where the metaplot and profiles will go. To do this I'm just going to be sampling the top 10% of transcribed genes in the genome to show the trend around them. Just in an attemp to show that "yeah, this is a representative example of an expressed gene in the maize genome."

Screen Shot 2020-09-23 at 9 59 50 AM

Jome0169 commented 3 years ago

Generation of tandem duplicate BEd file:

python generate_tandem_dup_bed.py -tandem_dup 00.data/tandem_duplicates/maize_tandem_duplicates.csv -bed 00.data/bed_files/Zea_mays.AGPv4.38.chr.all.bed | bedtools sort -i - | less -S > 00.data/tandem_duplicates/tandem_duplicate_formatted.bed

Jome0169 commented 3 years ago

In order to further explore the issue of whether the merged class contains tandem duplicates and whether we're merging these, I did an all vs all protein blast of proteins in the maize genome.

From this I used the script compare_tande_duplicates_blast_dotplot which takes in a list of comma pairs on each line, as well as a fasta file - extracts each pairs sequence, runs a blast analysis, as well as generates a dot plot using the R script dot_plot_generation.R .

Running Script: python compare_tande_duplicates_blast_dotplot.py -file_list merged_genes.cleaned_names.txt -fasta ../00.data/fasta/Zea_mays.AGPv4.pep.all.fa -out mergers_pre_assembly_pep

Generally most dotplots show no evidenceo f tandem duplicatn:

geneZm00001d026367_geneZm00001d026366_comparison_dotplot geneZm00001d026391_geneZm00001d026390_comparison_dotplot geneZm00001d026395_geneZm00001d026396_comparison_dotplot geneZm00001d026558_geneZm00001d026557_comparison_dotplot geneZm00001d026670_geneZm00001d026671_comparison_dotplot

Jome0169 commented 3 years ago

Those taht generally show a high level of tandem duplication are almost always a pair of genes which are clearly different iso-forms, but annotated as a single gene. For instance: http://epigenome.genetics.uga.edu/JBrowse/?data=maize_v4&loc=2%3A216440516..216443530&tracks=rnas%2Cgenes%2CB73Leaf1%2CB73Leaf2%2CChip_unique_B73_leaf_H3K36me3_rep1%2CChip_unique_B73_leaf_H3K36me3_rep2%2CChip_unique_B73_leaf_H3K4me1_rep1%2CChip_unique_B73_leaf_H3K4me1_rep2%2CChip_unique_B73_leaf_H3K4me3_rep1%2CChip_unique_B73_leaf_H3K4me3_rep2%2CChip_unique_B73_leaf_H3K56ac_rep1%2CChip_unique_B73_leaf_H3K56ac_rep2&highlight=