Closed yaaminiv closed 5 years ago
I emailed with Loveday Conquest about the suitability of using the Chi-Square test. Her comment was: "Chisquare tests on count data are appropriate when the process is 'random events in space or time'. The events need to be independent. After some discussion she thought it was an appropriate test.
For the 'expected' counts, I was using all CG's in a particular category and 'observed' was the number of methylated CG in that category. I haven't used chisq.test to know if it's getting used correctly or not. I was doing calculations in excel ;)
Your summary sheet shows more methylatedCpG than totalCpG for some categories - how are you getting these numbers?
Your summary sheet shows more methylatedCpG than totalCpG for some categories - how are you getting these numbers
I would not worry as much about the statistical test until you are super super confident that the numbers are accurate.
Your summary sheet shows more methylatedCpG than totalCpG for some categories - how are you getting these numbers
@mgavery I used intersectBED
to count the number of overlaps. Here are the Jupiter notebooks that explain how I got the information:
Intersection between CG motifs (totalCpG
) and various genomic features
Intersection between methylated CpGs and various genomic features
I don't think I'm double counting any CpGs so I'm not sure why there's a discrepancy (but what do I know I'm generally wrong about this stuff)...?
Let's try to troubleshoot...
Q1: How many CGs are in the genome? Q2: How do you know (HDYK) this is accurate (generally this is achieved by arriving at an answer with an independent method and/or visualization. Q3: How many genes are there and how many of CGs overlap with these genes (show code snippet).
Q3: How many genes are there and how many of CGs overlap with these genes (show code snippet).
There are 60,201 genes and 60,195 mRNA-CG motif overlaps.
Q1: How many CGs are in the genome?
I counted the lines of the CG motif file and got 14,458,703.
Q2: How do you know (HDYK) this is accurate (generally this is achieved by arriving at an answer with an independent method and/or visualization.
I guess I don't know if this is accurate or not. I got the totalCpG
category from the CG motifs track, but the all5xCpG
and methylatedCpG
information came from the concatenation file. Would the fact that the CG motifs are currently 2 bp and the methylatedCpG
file has 1 bp loci impact how those intersections are counted?
Q2 - It would appear you are basing this on a file, maybe one you did not create? How would you determine # of CGs if file was not provided?
Q3 - What is the basis of that number of genes? - are you just counting lines in a file? Where did this come from? How can you validate? (thus again HDYK) Also I asked
how many of CGs overlap with these genes
the code is wrong (thus the answer) see https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html
-u | Write original A entry once if any overlaps found in B. In other words, just report the fact at least one overlap was found in B. Restricted by -f and -r.
Another hint and clarification genes are not the same as mRNA https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Crassostrea_virginica/100/ On Apr 30, 2019, 4:04 PM -0700, Yaamini Venkataraman notifications@github.com, wrote:
Q3: How many genes are there and how many of CGs overlap with these genes (show code snippet). There are 60,201 genes and 60,195 mRNA-CG motif overlaps. Q1: How many CGs are in the genome? I counted the lines of the CG motif file and got 14,458,703. Q2: How do you know (HDYK) this is accurate (generally this is achieved by arriving at an answer with an independent method and/or visualization. I guess I don't know if this is accurate or not. I got the totalCpG category from the CG motifs track, but the all5xCpG and methylatedCpG information came from the concatenation file. Would the fact that the CG motifs are currently 2 bp and the methylatedCpG file has 1 bp loci impact how those intersections are counted? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
genes are not the same as mRNA
I think I had it somewhere in my notes to just assume for that the mRNA file was similar to the gene file. Never mind!
Q3 - What is the basis of that number of genes? - are you just counting lines in a file? Where did this come from? How can you validate?
Based on the NCBI link, there are 39,493 genes.
Q2 - It would appear you are basing this on a file, maybe one you did not create? How would you determine # of CGs if file was not provided?
I am basing this on a file that I did not create. I could go through the C. virginica genome file and grep
all CpGs then count...?
Agree, I would grep (actually fgrep, faster if not using regular expressions) On Apr 30, 2019, 5:09 PM -0700, RobertsLab/resources reply@reply.github.com, wrote:
grep all CpGs then count.
Reminder, if a CG is split across lines of FastA, grep won't pick it up.
On Tue, Apr 30, 2019, 17:16 Steven Roberts notifications@github.com wrote:
Agree, I would grep (actually fgrep, faster if not using regular expressions) On Apr 30, 2019, 5:09 PM -0700, RobertsLab/resources < reply@reply.github.com>, wrote:
grep all CpGs then count.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/RobertsLab/resources/issues/685#issuecomment-488159750, or mute the thread https://github.com/notifications/unsubscribe-auth/ABCOCOBKQ3D75CPN43L6AHDPTDOMVANCNFSM4HJIQLCQ .
Also, make sure FastA headers don't contain any CGs.
On Tue, Apr 30, 2019, 17:23 Sam White samuel.j.white@gmail.com wrote:
Reminder, if a CG is split across lines of FastA, grep won't pick it up.
On Tue, Apr 30, 2019, 17:16 Steven Roberts notifications@github.com wrote:
Agree, I would grep (actually fgrep, faster if not using regular expressions) On Apr 30, 2019, 5:09 PM -0700, RobertsLab/resources < reply@reply.github.com>, wrote:
grep all CpGs then count.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/RobertsLab/resources/issues/685#issuecomment-488159750, or mute the thread https://github.com/notifications/unsubscribe-auth/ABCOCOBKQ3D75CPN43L6AHDPTDOMVANCNFSM4HJIQLCQ .
@kubu4 How can I ensure that I pick up a CG that's split across lines? Also, does it make sense to then remove the headers before using fgrep
?
@yaaminiv -
How can I ensure that I pick up a CG that's split across lines? Also, does it make sense to then remove the headers before using fgrep?
not the point.
Just run fgrep on the file - and see what you get....
@sr320 I get 14,277,725 which is different than the 14,458,703 from the CG motif file.
Seems very reasonable given the caveats Sam has mention.
For the sake of moving forward you can assume the CG motif numbers are correct. And go ahead and answer Q3.
It of course would be great in the future to independently verify CG motifs.
On May 1, 2019, 4:04 PM -0700, Yaamini Venkataraman notifications@github.com, wrote:
@sr320 I get 14,277,725 which is different than the 14,458,703 from the CG motif file. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Q3: How many genes are there and how many of CGs overlap with these genes (show code snippet).
I know what intersectBed
code I could use to answer this question, but how do I actually get the gene start and stop positions in a tabular format? On this NCBI page I see a list of 39489 genes (which is not the same as the 39,493 genes listed in the Annotation Report).
Is there a way to modify the FastA file itself that I'm not thinking of?
This is not exactly super straightforward.... as you note
But NCBI does provide a GFF --- see https://d.pr/i/LXMZrn
Here is a bit
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build C_virginica-3.0
#!genome-build-accession NCBI_Assembly:GCF_002022765.2
#!annotation-date 14 September 2017
#!annotation-source NCBI Crassostrea virginica Annotation Release 100
##sequence-region NC_035780.1 1 65668440
##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=6565
NC_035780.1 RefSeq region 1 65668440 . + . ID=id0;Dbxref=taxon:6565;Name=1;chromosome=1;collection-date=22-Mar-2015;country=USA;gbkey=Src;genome=chromosome;isolate=RU13XGHG1-28;isolation-source=Rutgers Haskin Shellfish Research Laboratory inbred lines (NJ);mol_type=genomic DNA;tissue-type=whole sample
NC_035780.1 Gnomon gene 13578 14594 . + . ID=gene0;Dbxref=GeneID:111116054;Name=LOC111116054;gbkey=Gene;gene=LOC111116054;gene_biotype=lncRNA
NC_035780.1 Gnomon lnc_RNA 13578 14594 . + . ID=rna0;Parent=gene0;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;Name=XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;model_evidence=Supporting evidence includes similarity to: 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 1 sample with support for all annotated introns;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1
NC_035780.1 Gnomon exon 13578 13603 . + . ID=id1;Parent=rna0;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1
NC_035780.1 Gnomon exon 14237 14290 . + . ID=id2;Parent=rna0;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1
NC_035780.1 Gnomon exon 14557 14594 . + . ID=id3;Parent=rna0;Dbxref=GeneID:111116054,Genbank:XR_002636969.1;gbkey=ncRNA;gene=LOC111116054;product=uncharacterized LOC111116054;transcript_id=XR_002636969.1
NC_035780.1 Gnomon gene 28961 33324 . + . ID=gene1;Dbxref=GeneID:111126949;Name=LOC111126949;gbkey=Gene;gene=LOC111126949;gene_biotype=protein_coding
NC_035780.1 Gnomon mRNA 28961 33324 . + . ID=rna1;Parent=gene1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;Name=XM_022471938.1;gbkey=mRNA;gene=LOC111126949;model_evidence=Supporting evidence includes similarity to: 3 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 21 samples with support for all annotated introns;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1 Gnomon exon 28961 29073 . + . ID=id4;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1 Gnomon exon 30524 31557 . + . ID=id5;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1 Gnomon exon 31736 31887 . + . ID=id6;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1 Gnomon exon 31977 32565 . + . ID=id7;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1 Gnomon exon 32959 33324 . + . ID=id8;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XM_022471938.1;gbkey=mRNA;gene=LOC111126949;product=UNC5C-like protein;transcript_id=XM_022471938.1
NC_035780.1 Gnomon CDS 30535 31557 . + 0 ID=cds0;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XP_022327646.1;Name=XP_022327646.1;gbkey=CDS;gene=LOC111126949;product=UNC5C-like protein;protein_id=XP_022327646.1
NC_035780.1 Gnomon CDS 31736 31887 . + 0 ID=cds0;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XP_022327646.1;Name=XP_022327646.1;gbkey=CDS;gene=LOC111126949;product=UNC5C-like protein;protein_id=XP_022327646.1
NC_035780.1 Gnomon CDS 31977 32565 . + 1 ID=cds0;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XP_022327646.1;Name=XP_022327646.1;gbkey=CDS;gene=LOC111126949;product=UNC5C-like protein;protein_id=XP_022327646.1
NC_035780.1 Gnomon CDS 32959 33204 . + 0 ID=cds0;Parent=rna1;Dbxref=GeneID:111126949,Genbank:XP_022327646.1;Name=XP_022327646.1;gbkey=CDS;gene=LOC111126949;product=UNC5C-like protein;protein_id=XP_022327646.1
NC_035780.1 Gnomon gene 43111 66897 . - . ID=gene2;Dbxref=GeneID:111110729;Name=LOC111110729;gbkey=Gene;gene=LOC111110729;gene_biotype=protein_coding
NC_035780.1 Gnomon mRNA 43111 66897 . - . ID=rna2;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;Name=XM_022447324.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1
NC_035780.1 Gnomon exon 66869 66897 . - . ID=id9;Parent=rna2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;gbkey=mRNA;gene=LOC111110729;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1
NC_035780.1 Gnomon exon 64123 64334 . - . ID=id10;Parent=rna2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;gbkey=mRNA;gene=LOC111110729;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1
NC_035780.1 Gnomon exon 43111 44358 . - . ID=id11;Parent=rna2;Dbxref=GeneID:111110729,Genbank:XM_022447324.1;gbkey=mRNA;gene=LOC111110729;product=FMRFamide receptor-like%2C transcript variant X1;transcript_id=XM_022447324.1
NC_035780.1 Gnomon CDS 64123 64219 . - 0 ID=cds1;Parent=rna2;Dbxref=GeneID:111110729,Genbank:XP_022303032.1;Name=XP_022303032.1;gbkey=CDS;gene=LOC111110729;product=FMRFamide receptor-like isoform X1;protein_id=XP_022303032.1
NC_035780.1 Gnomon CDS 43262 44358 . - 2 ID=cds1;Parent=rna2;Dbxref=GeneID:111110729,Genbank:XP_022303032.1;Name=XP_022303032.1;gbkey=CDS;gene=LOC111110729;product=FMRFamide receptor-like isoform X1;protein_id=XP_022303032.1
NC_035780.1 Gnomon mRNA 43111 46506 . - . ID=rna3;Parent=gene2;Dbxref=GeneID:111110729,Genbank:XM_022447333.1;Name=XM_022447333.1;gbkey=mRNA;gene=LOC111110729;model_evidence=Supporting evidence includes similarity to: 1 Protein%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 14 samples with support for all annotated introns;product=FMRFamide receptor-like%2C transcript variant X2;transcript_id=XM_022447333.1
NC_035780.1 Gnomon exon 45913 46506 . - . ID=id12;Parent=rna3;Dbxref=GeneID:111110729,Genbank:XM_022447333.1;gbkey=mRNA;gene=LOC111110729;product=FMRFamide receptor-like%2C transcript variant X2;transcript_id=XM_022447333.1
NC_035780.1 Gnomon exon 43111 44358 . - . ID=id13;Parent=rna3;Dbxref=GeneID:111110729,Genbank:XM_022447333.1;gbkey=mRNA;gene=LOC111110729;product=FMRFamide receptor-like%2C transcript variant X2;transcript_id=XM_022447333.1
NC_035780.1 Gnomon CDS 45913 45997 . - 0 ID=cds2;Parent=rna3;Dbxref=GeneID:111110729,Genbank:XP_022303041.1;Name=XP_022303041.1;gbkey=CDS;gene=LOC111110729;product=FMRFamide receptor-like isoform X2;protein_id=XP_022303041.1
NC_035780.1 Gnomon CDS 43262 44358 . - 2 ID=cds2;Parent=rna3;Dbxref=GeneID:111110729,Genbank:XP_022303041.1;Name=XP_022303041.1;gbkey=CDS;gene=LOC111110729;product=FMRFamide receptor-like isoform X2;protein_id=XP_022303041.1
NC_035780.1 Gnomon gene 85606 95254 . - . ID=gene3;Dbxref=GeneID:111112434;Name=LOC111112434;gbkey=Gene;gene=LOC111112434;gene_biotype=protein_coding
NC_035780.1 Gnomon mRNA 85606 95254 . - . ID=rna4;Parent=gene3;Dbxref=GeneID:111112434,Genbank:XM_022449924.1;Name=XM_022449924.1;gbkey=mRNA;gene=LOC111112434;model_evidence=Supporting evidence includes similarity to: 7 Proteins%2C and 100%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 13 samples with support for all annotated introns;product=homeobox protein Hox-B7-like;transcript_id=XM_022449924.1
NC_035780.1 Gnomon exon 94571 95254 . - . ID=id14;Parent=rna4;Dbxref=GeneID:111112434,Genbank:XM_022449924.1;gbkey=mRNA;gene=LOC111112434;product=homeobox protein Hox-B7-like;transcript_id=XM_022449924.1
NC_035780.1 Gnomon exon 88423 88589 . - . ID=id15;Parent=rna4;Dbxref=GeneID:111112434,Genbank:XM_022449924.1;gbkey=mRNA;gene=LOC111112434;product=homeobox protein Hox-B7-like;transcript_id=XM_022449924.1
A good bet would be to grep
out Gnomon gene
and see if this "looks reasonable".
@sr320 I am unable to access the FTP site in my web browser even though I'm signed into NCBI. How do I access the GFF?
You do not need to be signed in. What happens when you do this? https://d.pr/i/LXMZrn
Tried it with Chrome and it works...looks like it's a browser issue.
How do I grep
out Gnomon gene
? I've tried the following code and still get Gnomon exon
, Gnomon mRNA
, etc.
!grep "Gnomon" ref_C_virginica-3.0_top_level.gff3 \
| grep "gene" \
> C_virginica-3.0_Gnomon_gene.gff3
I need to somehow specify I only want lines with gene
in the third column, so I tried this awk
command, which still didn't work.
!grep "Gnomon" ref_C_virginica-3.0_top_level.gff3 \
| awk '$3 ~ gene' \
> C_virginica-3.0_Gnomon_gene.gff3
url to jupyter nb? It is likely easier to collaborate in nb. On May 8, 2019, 10:11 AM -0700, Yaamini Venkataraman notifications@github.com, wrote:
How do I grep out Gnomon gene? I've tried the following code and still get Gnomon exon, Gnomon mRNA, etc. !grep "Gnomon" ref_C_virginica-3.0_top_level.gff3 \ | grep "gene" \
C_virginica-3.0_Gnomon_gene.gff3 I need to somehow specify I only want lines with gene in the third column, so I tried this awk command, which still didn't work. !grep "Gnomon" ref_C_virginica-3.0_top_level.gff3 \ | awk '$3 ~ gene' \ C_virginica-3.0_Gnomon_gene.gff3 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
grep "Gnomon gene"
?
Tried that! It's looking for Gnomon gene
in one column, which it can't find
try again - be sure there is a tab
To be extra safe you can just cut out of file and paste - this preserves tab
On May 8, 2019, 10:20 AM -0700, Yaamini Venkataraman notifications@github.com, wrote:
tried that! didn't work! — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
preserving the tab worked. used:
!grep "Gnomon gene" ref_C_virginica-3.0_top_level.gff3 > C_virginica-3.0_Gnomon_gene.gff3
Counted 38,929 lines (genes), which is different from the 39,493 listed in the Annotation Report.
Looked at the annotation report more closely...39,493 genes and pseudogenes listed. Subtracting the 667 pseudogenes from this total, I get 38,826 genes. So this number is pretty close to the 38,929 lines I counted.
Counted 7,914,842 CG motif overlaps with the genome:
This is less than the 14,458,703 I counted in the CG motif file, but more than the 4,304,257 CpGs with 5x coverage.
but by genome you mean? On May 8, 2019, 10:40 AM -0700, Yaamini Venkataraman notifications@github.com, wrote:
Counted 7,914,842 CG motif overlaps with the genome: — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
but by genome you mean?
I mean the genes I isolated with grep
(forgot to change that text)
Next would be exon versus intron On May 8, 2019, 10:43 AM -0700, Yaamini Venkataraman notifications@github.com, wrote:
but by genome you mean? I mean the genes I isolated with grep (forgot to change that text) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Next would be exon versus intron
CG motifs with exons and introns? Should I use the exon/intron GFF files that have already been created, and try extracting Gnmon exon
from the file I downloaded this morning to compare?
CG motifs with exons and introns?
What were you thinking?
Should I use the exon/intron GFF files that have already been created, and try extracting Gnmon exon from the file I downloaded this morning to compare?
You should derive an exon and intron track.
CG motifs overlap with exons (track I downloaded): 2323389
CG motifs overlap with introns (track I downloaded): 5297975
You should derive an exon and intron track
CG motifs overlap with exons (track I created): 2330546
I couldn't derive an intron track using grep "Gnomon intron"
since there is no intron listing in that third column. I could try subtracting the start and stop positions of exons from individual mRNA to generate an intron track, but would need some way of efficiently pairing the exons with mRNA in code
What software would you use to create an intron file, or for that matter almost anything to do with genome feature tracks?
@sr320 Created a new issue for trouble using subtractBed
to generate an intron track. Will return to this thread once that issue is resolved.
Here's a table with various feature tracks and the number of times the CG motifs overlap with the feature:
Feature | Number of Elements | CG Motif Overlaps with Feature |
---|---|---|
Genes | 38,929 | 7,914,842 |
Intergenic regions | 34,557 | 6,545,363 |
mRNA | 60,201 | 7,507,167 |
Exons | 731,279 | 2,330,546 |
Introns | 316,614 | 5,596,808 |
Coding sequences | 645,335 | 1,728,032 |
Non-coding sequences | 336,677 | 12,142,171 |
Untranslated regions of exons | 182,752 | 602,551 |
lncRNA | 4750 | 281,715 |
Your summary sheet shows more methylatedCpG than totalCpG for some categories
Updated my summary table with overlap counts using my newly generated tracks, and I no longer have that problem!
Based on your discussion with Loveday Conquest @mgavery and my inability to find another test that could work, I think I'll stick with the chi-squared tests.
R Markdown file
I characterized genomic locations for all CpGs in the C. virginica genome, all 5x CpGs, methylated CpGs, sparsely methylated CpGs, unmethylated CpGs, and DML. I'd like to take the information in this summary table and run a statistical analysis to determine if the distribution of CpGs in various categories are the same (ex. methylated CpGs vs. DML).
This has been done in previous papers by Claire and Mac with a chi-squared test. I used this method to see if my distributions were statistically different, but I'm actually not sure if chi-squared is the best way to test this. For example, I would like the distribution of methylated CpGs to serve as a background for the DML distribution in the genome, but when I use
chisq.test
my expected values don't reflect that "background" distribution. Looking at Claire's bioRXiv paper there was also some post-hoc test that was conducted, but I can't figure out what she did in this code.@mgavery Do you remember why you used a chi-squared test to compare distributions in your paper? Maybe I'm just not interpreting the test correctly?