RobertsLab / resources

https://robertslab.github.io/resources/
18 stars 11 forks source link

Data analysis method for general methylation trends #685

Closed yaaminiv closed 5 years ago

yaaminiv commented 5 years ago

R Markdown file

I characterized genomic locations for all CpGs in the C. virginica genome, all 5x CpGs, methylated CpGs, sparsely methylated CpGs, unmethylated CpGs, and DML. I'd like to take the information in this summary table and run a statistical analysis to determine if the distribution of CpGs in various categories are the same (ex. methylated CpGs vs. DML).

This has been done in previous papers by Claire and Mac with a chi-squared test. I used this method to see if my distributions were statistically different, but I'm actually not sure if chi-squared is the best way to test this. For example, I would like the distribution of methylated CpGs to serve as a background for the DML distribution in the genome, but when I use chisq.test my expected values don't reflect that "background" distribution. Looking at Claire's bioRXiv paper there was also some post-hoc test that was conducted, but I can't figure out what she did in this code.

@mgavery Do you remember why you used a chi-squared test to compare distributions in your paper? Maybe I'm just not interpreting the test correctly?

mgavery commented 5 years ago

I emailed with Loveday Conquest about the suitability of using the Chi-Square test. Her comment was: "Chisquare tests on count data are appropriate when the process is 'random events in space or time'. The events need to be independent. After some discussion she thought it was an appropriate test.

For the 'expected' counts, I was using all CG's in a particular category and 'observed' was the number of methylated CG in that category. I haven't used chisq.test to know if it's getting used correctly or not. I was doing calculations in excel ;)

Your summary sheet shows more methylatedCpG than totalCpG for some categories - how are you getting these numbers?

sr320 commented 5 years ago

Your summary sheet shows more methylatedCpG than totalCpG for some categories - how are you getting these numbers

I would not worry as much about the statistical test until you are super super confident that the numbers are accurate.

yaaminiv commented 5 years ago

Your summary sheet shows more methylatedCpG than totalCpG for some categories - how are you getting these numbers

@mgavery I used intersectBED to count the number of overlaps. Here are the Jupiter notebooks that explain how I got the information:

Intersection between CG motifs (totalCpG) and various genomic features

Intersection between methylated CpGs and various genomic features

I don't think I'm double counting any CpGs so I'm not sure why there's a discrepancy (but what do I know I'm generally wrong about this stuff)...?

sr320 commented 5 years ago

Let's try to troubleshoot...

Q1: How many CGs are in the genome? Q2: How do you know (HDYK) this is accurate (generally this is achieved by arriving at an answer with an independent method and/or visualization. Q3: How many genes are there and how many of CGs overlap with these genes (show code snippet).

yaaminiv commented 5 years ago

Q3: How many genes are there and how many of CGs overlap with these genes (show code snippet).

There are 60,201 genes and 60,195 mRNA-CG motif overlaps.

Screen Shot 2019-04-30 at 3 56 38 PM Screen Shot 2019-04-30 at 3 55 21 PM

Q1: How many CGs are in the genome?

I counted the lines of the CG motif file and got 14,458,703.

Screen Shot 2019-04-30 at 3 59 11 PM

Q2: How do you know (HDYK) this is accurate (generally this is achieved by arriving at an answer with an independent method and/or visualization.

I guess I don't know if this is accurate or not. I got the totalCpG category from the CG motifs track, but the all5xCpG and methylatedCpG information came from the concatenation file. Would the fact that the CG motifs are currently 2 bp and the methylatedCpG file has 1 bp loci impact how those intersections are counted?

Screen Shot 2019-04-30 at 4 01 36 PM Screen Shot 2019-04-30 at 4 01 13 PM
sr320 commented 5 years ago

Q2 - It would appear you are basing this on a file, maybe one you did not create? How would you determine # of CGs if file was not provided?

Q3 - What is the basis of that number of genes? - are you just counting lines in a file? Where did this come from? How can you validate? (thus again HDYK) Also I asked

how many of CGs overlap with these genes

the code is wrong (thus the answer) see https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html

-u | Write original A entry once if any overlaps found in B. In other words, just report the fact at least one overlap was found in B. Restricted by -f and -r.
sr320 commented 5 years ago

Another hint and clarification genes are not the same as mRNA https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Crassostrea_virginica/100/ On Apr 30, 2019, 4:04 PM -0700, Yaamini Venkataraman notifications@github.com, wrote:

Q3: How many genes are there and how many of CGs overlap with these genes (show code snippet). There are 60,201 genes and 60,195 mRNA-CG motif overlaps. Q1: How many CGs are in the genome? I counted the lines of the CG motif file and got 14,458,703. Q2: How do you know (HDYK) this is accurate (generally this is achieved by arriving at an answer with an independent method and/or visualization. I guess I don't know if this is accurate or not. I got the totalCpG category from the CG motifs track, but the all5xCpG and methylatedCpG information came from the concatenation file. Would the fact that the CG motifs are currently 2 bp and the methylatedCpG file has 1 bp loci impact how those intersections are counted? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

yaaminiv commented 5 years ago

genes are not the same as mRNA

I think I had it somewhere in my notes to just assume for that the mRNA file was similar to the gene file. Never mind!

Q3 - What is the basis of that number of genes? - are you just counting lines in a file? Where did this come from? How can you validate?

Based on the NCBI link, there are 39,493 genes.

Q2 - It would appear you are basing this on a file, maybe one you did not create? How would you determine # of CGs if file was not provided?

I am basing this on a file that I did not create. I could go through the C. virginica genome file and grep all CpGs then count...?

sr320 commented 5 years ago

Agree, I would grep (actually fgrep, faster if not using regular expressions) On Apr 30, 2019, 5:09 PM -0700, RobertsLab/resources reply@reply.github.com, wrote:

grep all CpGs then count.

kubu4 commented 5 years ago

Reminder, if a CG is split across lines of FastA, grep won't pick it up.

On Tue, Apr 30, 2019, 17:16 Steven Roberts notifications@github.com wrote:

Agree, I would grep (actually fgrep, faster if not using regular expressions) On Apr 30, 2019, 5:09 PM -0700, RobertsLab/resources < reply@reply.github.com>, wrote:

grep all CpGs then count.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/RobertsLab/resources/issues/685#issuecomment-488159750, or mute the thread https://github.com/notifications/unsubscribe-auth/ABCOCOBKQ3D75CPN43L6AHDPTDOMVANCNFSM4HJIQLCQ .

kubu4 commented 5 years ago

Also, make sure FastA headers don't contain any CGs.

On Tue, Apr 30, 2019, 17:23 Sam White samuel.j.white@gmail.com wrote:

Reminder, if a CG is split across lines of FastA, grep won't pick it up.

On Tue, Apr 30, 2019, 17:16 Steven Roberts notifications@github.com wrote:

Agree, I would grep (actually fgrep, faster if not using regular expressions) On Apr 30, 2019, 5:09 PM -0700, RobertsLab/resources < reply@reply.github.com>, wrote:

grep all CpGs then count.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/RobertsLab/resources/issues/685#issuecomment-488159750, or mute the thread https://github.com/notifications/unsubscribe-auth/ABCOCOBKQ3D75CPN43L6AHDPTDOMVANCNFSM4HJIQLCQ .

yaaminiv commented 5 years ago

@kubu4 How can I ensure that I pick up a CG that's split across lines? Also, does it make sense to then remove the headers before using fgrep?

sr320 commented 5 years ago

@yaaminiv -

How can I ensure that I pick up a CG that's split across lines? Also, does it make sense to then remove the headers before using fgrep?

not the point.

Just run fgrep on the file - and see what you get....

yaaminiv commented 5 years ago

@sr320 I get 14,277,725 which is different than the 14,458,703 from the CG motif file.

Screen Shot 2019-05-01 at 4 03 04 PM
sr320 commented 5 years ago

Seems very reasonable given the caveats Sam has mention.

For the sake of moving forward you can assume the CG motif numbers are correct. And go ahead and answer Q3.

It of course would be great in the future to independently verify CG motifs.

On May 1, 2019, 4:04 PM -0700, Yaamini Venkataraman notifications@github.com, wrote:

@sr320 I get 14,277,725 which is different than the 14,458,703 from the CG motif file. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

yaaminiv commented 5 years ago

Q3: How many genes are there and how many of CGs overlap with these genes (show code snippet).

I know what intersectBed code I could use to answer this question, but how do I actually get the gene start and stop positions in a tabular format? On this NCBI page I see a list of 39489 genes (which is not the same as the 39,493 genes listed in the Annotation Report).

Is there a way to modify the FastA file itself that I'm not thinking of?

sr320 commented 5 years ago

This is not exactly super straightforward.... as you note

But NCBI does provide a GFF --- see https://d.pr/i/LXMZrn

Here is a bit

##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build C_virginica-3.0
#!genome-build-accession NCBI_Assembly:GCF_002022765.2
#!annotation-date 14 September 2017
#!annotation-source NCBI Crassostrea virginica Annotation Release 100
##sequence-region NC_035780.1 1 65668440
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=6565
NC_035780.1 RefSeq  region  1   65668440    .   +   .   ID=id0;Dbxref=taxon:6565;Name=1;chromosome=1;collection-date=22-Mar-2015;country=USA;gbkey=Src;genome=chromosome;isolate=RU13XGHG1-28;isolation-source=Rutgers Haskin Shellfish Research Laboratory inbred lines (NJ);mol_type=genomic DNA;tissue-type=whole sample
NC_035780.1 Gnomon  gene    13578   14594   .   +   .   ID=gene0;Dbxref=GeneID:111116054;Name=LOC111116054;gbkey=Gene;gene=LOC111116054;gene_biotype=lncRNA
NC_035780.1 Gnomon  lnc_RNA 13578   14594   .   +   .   ID=rna0;Parent=gene0;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;Name=XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1
NC_035780.1 Gnomon  exon    13578   13603   .   +   .   ID=id1;Parent=rna0;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1
NC_035780.1 Gnomon  exon    14237   14290   .   +   .   ID=id2;Parent=rna0;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1
NC_035780.1 Gnomon  exon    14557   14594   .   +   .   ID=id3;Parent=rna0;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1
NC_035780.1 Gnomon  gene    28961   33324   .   +   .   ID=gene1;Dbxref=GeneID:111126949;Name=LOC111126949;gbkey=Gene;gene=LOC111126949;gene_biotype=protein_coding
NC_035780.1 Gnomon  mRNA    28961   33324   .   +   .   ID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1 Gnomon  exon    28961   29073   .   +   .   ID=id4;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1 Gnomon  exon    30524   31557   .   +   .   ID=id5;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1 Gnomon  exon    31736   31887   .   +   .   ID=id6;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1 Gnomon  exon    31977   32565   .   +   .   ID=id7;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1 Gnomon  exon    32959   33324   .   +   .   ID=id8;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1 Gnomon  CDS 30535   31557   .   +   0   ID=cds0;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XP_022327646.1;Name=XP_022327646.1;gbkey=CDS;gene=LOC111126949;product=UNC5C-like protein;protein_id=XP_022327646.1
NC_035780.1 Gnomon  CDS 31736   31887   .   +   0   ID=cds0;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XP_022327646.1;Name=XP_022327646.1;gbkey=CDS;gene=LOC111126949;product=UNC5C-like protein;protein_id=XP_022327646.1
NC_035780.1 Gnomon  CDS 31977   32565   .   +   1   ID=cds0;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XP_022327646.1;Name=XP_022327646.1;gbkey=CDS;gene=LOC111126949;product=UNC5C-like protein;protein_id=XP_022327646.1
NC_035780.1 Gnomon  CDS 32959   33204   .   +   0   ID=cds0;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XP_022327646.1;Name=XP_022327646.1;gbkey=CDS;gene=LOC111126949;product=UNC5C-like protein;protein_id=XP_022327646.1
NC_035780.1 Gnomon  gene    43111   66897   .   -   .   ID=gene2;Dbxref=GeneID:111110729;Name=LOC111110729;gbkey=Gene;gene=LOC111110729;gene_biotype=protein_coding
NC_035780.1 Gnomon  mRNA    43111   66897   .   -   .   ID=rna2;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;Name=XM_022447324.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1
NC_035780.1 Gnomon  exon    66869   66897   .   -   .   ID=id9;Parent=rna2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;gbkey=mRNA;gene=LOC111110729;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1
NC_035780.1 Gnomon  exon    64123   64334   .   -   .   ID=id10;Parent=rna2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;gbkey=mRNA;gene=LOC111110729;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1
NC_035780.1 Gnomon  exon    43111   44358   .   -   .   ID=id11;Parent=rna2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;gbkey=mRNA;gene=LOC111110729;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1
NC_035780.1 Gnomon  CDS 64123   64219   .   -   0   ID=cds1;Parent=rna2;Dbxref=GeneID:111110729,Genbank:XP_022303032.1;Name=XP_022303032.1;gbkey=CDS;gene=LOC111110729;product=FMRFamide receptor-like isoform X1;protein_id=XP_022303032.1
NC_035780.1 Gnomon  CDS 43262   44358   .   -   2   ID=cds1;Parent=rna2;Dbxref=GeneID:111110729,Genbank:XP_022303032.1;Name=XP_022303032.1;gbkey=CDS;gene=LOC111110729;product=FMRFamide receptor-like isoform X1;protein_id=XP_022303032.1
NC_035780.1 Gnomon  mRNA    43111   46506   .   -   .   ID=rna3;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447333.1;Name=XM_022447333.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 14 samples with support for all annotated introns;product=FMRFamide receptor-like%2C transcript variant X2;transcript_id=XM_022447333.1
NC_035780.1 Gnomon  exon    45913   46506   .   -   .   ID=id12;Parent=rna3;Dbxref=GeneID:111110729,Genbank:XM_022447333.1;gbkey=mRNA;gene=LOC111110729;product=FMRFamide receptor-like%2C transcript variant X2;transcript_id=XM_022447333.1
NC_035780.1 Gnomon  exon    43111   44358   .   -   .   ID=id13;Parent=rna3;Dbxref=GeneID:111110729,Genbank:XM_022447333.1;gbkey=mRNA;gene=LOC111110729;product=FMRFamide receptor-like%2C transcript variant X2;transcript_id=XM_022447333.1
NC_035780.1 Gnomon  CDS 45913   45997   .   -   0   ID=cds2;Parent=rna3;Dbxref=GeneID:111110729,Genbank:XP_022303041.1;Name=XP_022303041.1;gbkey=CDS;gene=LOC111110729;product=FMRFamide receptor-like isoform X2;protein_id=XP_022303041.1
NC_035780.1 Gnomon  CDS 43262   44358   .   -   2   ID=cds2;Parent=rna3;Dbxref=GeneID:111110729,Genbank:XP_022303041.1;Name=XP_022303041.1;gbkey=CDS;gene=LOC111110729;product=FMRFamide receptor-like isoform X2;protein_id=XP_022303041.1
NC_035780.1 Gnomon  gene    85606   95254   .   -   .   ID=gene3;Dbxref=GeneID:111112434;Name=LOC111112434;gbkey=Gene;gene=LOC111112434;gene_biotype=protein_coding
NC_035780.1 Gnomon  mRNA    85606   95254   .   -   .   ID=rna4;Parent=gene3;Dbxref=GeneID:111112434,Genbank:XM_022449924.1;Name=XM_022449924.1;gbkey=mRNA;gene=LOC111112434;model_evidence=Supporting evidence includes similarity to: 7 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 13 samples with support for all annotated introns;product=homeobox protein Hox-B7-like;transcript_id=XM_022449924.1
NC_035780.1 Gnomon  exon    94571   95254   .   -   .   ID=id14;Parent=rna4;Dbxref=GeneID:111112434,Genbank:XM_022449924.1;gbkey=mRNA;gene=LOC111112434;product=homeobox protein Hox-B7-like;transcript_id=XM_022449924.1
NC_035780.1 Gnomon  exon    88423   88589   .   -   .   ID=id15;Parent=rna4;Dbxref=GeneID:111112434,Genbank:XM_022449924.1;gbkey=mRNA;gene=LOC111112434;product=homeobox protein Hox-B7-like;transcript_id=XM_022449924.1

A good bet would be to grep out Gnomon gene and see if this "looks reasonable".

yaaminiv commented 5 years ago

@sr320 I am unable to access the FTP site in my web browser even though I'm signed into NCBI. How do I access the GFF?

sr320 commented 5 years ago

You do not need to be signed in. What happens when you do this? https://d.pr/i/LXMZrn

yaaminiv commented 5 years ago

Screen Shot 2019-05-08 at 9 47 49 AM

Tried it with Chrome and it works...looks like it's a browser issue.

yaaminiv commented 5 years ago

How do I grep out Gnomon gene? I've tried the following code and still get Gnomon exon, Gnomon mRNA, etc.

!grep "Gnomon" ref_C_virginica-3.0_top_level.gff3 \
| grep "gene"  \
> C_virginica-3.0_Gnomon_gene.gff3

I need to somehow specify I only want lines with gene in the third column, so I tried this awk command, which still didn't work.

!grep "Gnomon" ref_C_virginica-3.0_top_level.gff3 \
| awk '$3 ~ gene' \
> C_virginica-3.0_Gnomon_gene.gff3
sr320 commented 5 years ago

url to jupyter nb? It is likely easier to collaborate in nb. On May 8, 2019, 10:11 AM -0700, Yaamini Venkataraman notifications@github.com, wrote:

How do I grep out Gnomon gene? I've tried the following code and still get Gnomon exon, Gnomon mRNA, etc. !grep "Gnomon" ref_C_virginica-3.0_top_level.gff3 \ | grep "gene" \

C_virginica-3.0_Gnomon_gene.gff3 I need to somehow specify I only want lines with gene in the third column, so I tried this awk command, which still didn't work. !grep "Gnomon" ref_C_virginica-3.0_top_level.gff3 \ | awk '$3 ~ gene' \ C_virginica-3.0_Gnomon_gene.gff3 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

yaaminiv commented 5 years ago

https://github.com/fish546-2018/yaamini-virginica/blob/master/notebooks/2019-03-18-Characterizing-CpG-Methylation.ipynb

sr320 commented 5 years ago

grep "Gnomon gene" ?

yaaminiv commented 5 years ago

Tried that! It's looking for Gnomon gene in one column, which it can't find

Screen Shot 2019-05-08 at 10 19 50 AM

sr320 commented 5 years ago

try again - be sure there is a tab

To be extra safe you can just cut out of file and paste - this preserves tab

On May 8, 2019, 10:20 AM -0700, Yaamini Venkataraman notifications@github.com, wrote:

tried that! didn't work! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

yaaminiv commented 5 years ago

preserving the tab worked. used:

!grep "Gnomon gene" ref_C_virginica-3.0_top_level.gff3 > C_virginica-3.0_Gnomon_gene.gff3

Counted 38,929 lines (genes), which is different from the 39,493 listed in the Annotation Report.

yaaminiv commented 5 years ago

Looked at the annotation report more closely...39,493 genes and pseudogenes listed. Subtracting the 667 pseudogenes from this total, I get 38,826 genes. So this number is pretty close to the 38,929 lines I counted.

yaaminiv commented 5 years ago

Counted 7,914,842 CG motif overlaps with the genome:

Screen Shot 2019-05-08 at 10 34 38 AM

This is less than the 14,458,703 I counted in the CG motif file, but more than the 4,304,257 CpGs with 5x coverage.

sr320 commented 5 years ago

but by genome you mean? On May 8, 2019, 10:40 AM -0700, Yaamini Venkataraman notifications@github.com, wrote:

Counted 7,914,842 CG motif overlaps with the genome: — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

yaaminiv commented 5 years ago

but by genome you mean?

I mean the genes I isolated with grep (forgot to change that text)

sr320 commented 5 years ago

Next would be exon versus intron On May 8, 2019, 10:43 AM -0700, Yaamini Venkataraman notifications@github.com, wrote:

but by genome you mean? I mean the genes I isolated with grep (forgot to change that text) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

yaaminiv commented 5 years ago

Next would be exon versus intron

CG motifs with exons and introns? Should I use the exon/intron GFF files that have already been created, and try extracting Gnmon exon from the file I downloaded this morning to compare?

sr320 commented 5 years ago

CG motifs with exons and introns?

What were you thinking?

Should I use the exon/intron GFF files that have already been created, and try extracting Gnmon exon from the file I downloaded this morning to compare?

You should derive an exon and intron track.

yaaminiv commented 5 years ago

CG motifs overlap with exons (track I downloaded): 2323389

CG motifs overlap with introns (track I downloaded): 5297975

You should derive an exon and intron track

CG motifs overlap with exons (track I created): 2330546

I couldn't derive an intron track using grep "Gnomon intron" since there is no intron listing in that third column. I could try subtracting the start and stop positions of exons from individual mRNA to generate an intron track, but would need some way of efficiently pairing the exons with mRNA in code

sr320 commented 5 years ago

What software would you use to create an intron file, or for that matter almost anything to do with genome feature tracks?

yaaminiv commented 5 years ago

@sr320 Created a new issue for trouble using subtractBed to generate an intron track. Will return to this thread once that issue is resolved.

yaaminiv commented 5 years ago

Jupyter notebook

Here's a table with various feature tracks and the number of times the CG motifs overlap with the feature:

Feature Number of Elements CG Motif Overlaps with Feature
Genes 38,929 7,914,842
Intergenic regions 34,557 6,545,363
mRNA 60,201 7,507,167
Exons 731,279 2,330,546
Introns 316,614 5,596,808
Coding sequences 645,335 1,728,032
Non-coding sequences 336,677 12,142,171
Untranslated regions of exons 182,752 602,551
lncRNA 4750 281,715
yaaminiv commented 5 years ago

Your summary sheet shows more methylatedCpG than totalCpG for some categories

Updated my summary table with overlap counts using my newly generated tracks, and I no longer have that problem!

Based on your discussion with Loveday Conquest @mgavery and my inability to find another test that could work, I think I'll stick with the chi-squared tests.