WGLab / PennCNV

Copy number vaiation detection from SNP arrays
http://penncnv.openbioinformatics.org
Other
88 stars 53 forks source link

non-European populations GC-waves models and ASA chip #97

Open Captain-Pam opened 1 year ago

Captain-Pam commented 1 year ago

Hi, Kai Thank you for your tool. I am trying to apply it in my work. I have microarray sequencing data of about 3000 individuals with ASA (Asian Screening Array) chips from Illumina. I am trying to correct GC-waves for LRR/BAF using PennCNV. 1) Do I need to compile the GC model file myself based on gc5Base.txt.gz(hg19)? 2) Can the above document be applied to non-European groups (gc5Base.txt.gz) as well?

I am looking forward to hearing from you soon.

kaichop commented 1 year ago

1, yes you need to compile the GC model file based on gc5Base.txt.gz from the PennCNV package (in lib/ folder)

  1. yes, it is only dependent on reference genome

On Thu, Oct 27, 2022 at 10:58 PM Pam @.***> wrote:

Hi, Kai Thank you for your tool. I am trying to apply it in my work. I have microarray sequencing data of about 3000 individuals with ASA ( Asian Screening Array) chips from Illumina. I am trying to correct GC-waves for LRR/BAF using PennCNV.

  1. Do I need to compile the GC model file myself based on gc5Base.txt.gz(hg19)?
  2. Can the above document be applied to non-European groups (gc5Base.txt.gz) as well?

I am looking forward to hearing from you soon.

— Reply to this email directly, view it on GitHub https://github.com/WGLab/PennCNV/issues/97, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OC7CQSE4D32LJ55JC3WFM6NVANCNFSM6AAAAAARQUSMIM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Captain-Pam commented 1 year ago

Thank you for your reply. I will apply it to ASA.

Captain-Pam commented 1 year ago

1, yes you need to compile the GC model file based on gc5Base.txt.gz from the PennCNV package (in lib/ folder) 2. yes, it is only dependent on reference genome On Thu, Oct 27, 2022 at 10:58 PM Pam @.> wrote: Hi, Kai Thank you for your tool. I am trying to apply it in my work. I have microarray sequencing data of about 3000 individuals with ASA ( Asian Screening Array) chips from Illumina. I am trying to correct GC-waves for LRR/BAF using PennCNV. 1. Do I need to compile the GC model file myself based on gc5Base.txt.gz(hg19)? 2. Can the above document be applied to non-European groups (gc5Base.txt.gz) as well? I am looking forward to hearing from you soon. — Reply to this email directly, view it on GitHub <#97>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OC7CQSE4D32LJ55JC3WFM6NVANCNFSM6AAAAAARQUSMIM . You are receiving this because you are subscribed to this thread.Message ID: @.>

Hi, Kai I have another question about "genomic_wave.pl".
Are autosomal chromosomes and mitochondria probes corrected for GC waves using "PennCNV", separately? Because when I was using "genomic_wave.pl" I found it had a parameter "--distance ", which refers to the minimum marker-marker distance for training model (default=1Mb). However, the length of the chromosomal mitochondria was only 16569 bp. So what (--distance ) is the appropriate setting for mitochondria?

I am looking forward to hearing from you soon.

kaichop commented 1 year ago

I do not have experience with mitochondria. The current gc5file cannot be used on mitochondria since its value is 5kb distance, and even if you want to adjust GC, you have to compile a GC model yourself using a custom threshold such as 1kb sequence surrounding the marker in mitochondria.

On Fri, Oct 28, 2022 at 9:04 AM Pam @.***> wrote:

1, yes you need to compile the GC model file based on gc5Base.txt.gz from the PennCNV package (in lib/ folder) 2. yes, it is only dependent on reference genome … <#m-2070931763492051298> On Thu, Oct 27, 2022 at 10:58 PM Pam @.*> wrote: Hi, Kai Thank you for your tool. I am trying to apply it in my work. I have microarray sequencing data of about 3000 individuals with ASA ( Asian Screening Array) chips from Illumina. I am trying to correct GC-waves for LRR/BAF using PennCNV. 1. Do I need to compile the GC model file myself based on gc5Base.txt.gz(hg19)?

  1. Can the above document be applied to non-European groups (gc5Base.txt.gz) as well? I am looking forward to hearing from you soon. — Reply to this email directly, view it on GitHub <#97 https://github.com/WGLab/PennCNV/issues/97>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OC7CQSE4D32LJ55JC3WFM6NVANCNFSM6AAAAAARQUSMIM https://github.com/notifications/unsubscribe-auth/ABNG3OC7CQSE4D32LJ55JC3WFM6NVANCNFSM6AAAAAARQUSMIM . You are receiving this because you are subscribed to this thread.Message ID: @.*>

Hi, Kai I have another question about "genomic_wave.pl". Are autosomal chromosomes and mitochondria probes corrected for GC waves using "PennCNV", separately? Because when I was using "genomic_wave.pl" I found it had a parameter "--distance ", which refers to the minimum marker-marker distance for training model (default=1Mb). However, the length of the chromosomal mitochondria was only 16569 bp. So what (--distance ) is the appropriate setting for mitochondria?

I am looking forward to hearing from you soon.

— Reply to this email directly, view it on GitHub https://github.com/WGLab/PennCNV/issues/97#issuecomment-1294975511, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OGF6KW5QJDBY2FTKFTWFPFMTANCNFSM6AAAAAARQUSMIM . You are receiving this because you commented.Message ID: @.***>

Captain-Pam commented 1 year ago

Hi, Kai     Thank you for your reply. If I understand correctly, gc5Base.txt.gz cannot be used to compile the mitochondrial GC model using "cal_gc_snp.pl". Need to retrain GC model for mitochondria using mitochondrial reference genome, which is similar to a reference model (chr 11 ) in paper. Is there a specific method for this training? It is very difficult to do it.

I am looking forward to hearing from you soon.

     

------------------ 原始邮件 ------------------ 发件人: "WGLab/PennCNV" @.>; 发送时间: 2022年10月28日(星期五) 晚上9:28 @.>; @.**@.>; 主题: Re: [WGLab/PennCNV] non-European populations GC-waves models and ASA chip (Issue #97)

I do not have experience with mitochondria. The current gc5file cannot be used on mitochondria since its value is 5kb distance, and even if you want to adjust GC, you have to compile a GC model yourself using a custom threshold such as 1kb sequence surrounding the marker in mitochondria.

On Fri, Oct 28, 2022 at 9:04 AM Pam @.***> wrote:

> 1, yes you need to compile the GC model file based on gc5Base.txt.gz from > the PennCNV package (in lib/ folder) 2. yes, it is only dependent on > reference genome > … <#m-2070931763492051298> > On Thu, Oct 27, 2022 at 10:58 PM Pam @.> wrote: Hi, Kai Thank you for > your tool. I am trying to apply it in my work. I have microarray sequencing > data of about 3000 individuals with ASA ( Asian Screening Array) chips from > Illumina. I am trying to correct GC-waves for LRR/BAF using PennCNV. 1. Do > I need to compile the GC model file myself based on gc5Base.txt.gz(hg19)? > 2. Can the above document be applied to non-European groups > (gc5Base.txt.gz) as well? I am looking forward to hearing from you soon. — > Reply to this email directly, view it on GitHub <#97 > <https://github.com/WGLab/PennCNV/issues/97&gt;&gt;, or unsubscribe > https://github.com/notifications/unsubscribe-auth/ABNG3OC7CQSE4D32LJ55JC3WFM6NVANCNFSM6AAAAAARQUSMIM > <https://github.com/notifications/unsubscribe-auth/ABNG3OC7CQSE4D32LJ55JC3WFM6NVANCNFSM6AAAAAARQUSMIM&gt; > . You are receiving this because you are subscribed to this thread.Message > ID: @.> > > Hi, Kai > I have another question about "genomic_wave.pl". > Are autosomal chromosomes and mitochondria probes corrected for GC waves > using "PennCNV", separately? Because when I was using "genomic_wave.pl" I > found it had a parameter "--distance ", which refers to the minimum > marker-marker distance for training model (default=1Mb). However, the > length of the chromosomal mitochondria was only 16569 bp. So what > (--distance ) is the appropriate setting for mitochondria? > > I am looking forward to hearing from you soon. > > — > Reply to this email directly, view it on GitHub > <https://github.com/WGLab/PennCNV/issues/97#issuecomment-1294975511&gt;, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/ABNG3OGF6KW5QJDBY2FTKFTWFPFMTANCNFSM6AAAAAARQUSMIM&gt; > . > You are receiving this because you commented.Message ID: > @.***> >

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

kaichop commented 1 year ago

I suggest do not do any adjustment with mitochondria since it is too small. But if you want to do adjustment, you only need to compile a GC model file that lists the GC content of regions surrounding each marker. There is nothing to train.

On Fri, Oct 28, 2022 at 9:54 AM Pam @.***> wrote:

Hi, Kai     Thank you for your reply. If I understand correctly, gc5Base.txt.gz cannot be used to compile the mitochondrial GC model using "cal_gc_snp.pl". Need to retrain GC model for mitochondria using mitochondrial reference genome, which is similar to a reference model (chr 11 ) in paper. Is there a specific method for this training? It is very difficult to do it.

I am looking forward to hearing from you soon.

     

------------------ 原始邮件 ------------------ 发件人: "WGLab/PennCNV" @.>; 发送时间: 2022年10月28日(星期五) 晚上9:28 @.>; @.**@.>; 主题: Re: [WGLab/PennCNV] non-European populations GC-waves models and ASA chip (Issue #97)

I do not have experience with mitochondria. The current gc5file cannot be used on mitochondria since its value is 5kb distance, and even if you want to adjust GC, you have to compile a GC model yourself using a custom threshold such as 1kb sequence surrounding the marker in mitochondria.

On Fri, Oct 28, 2022 at 9:04 AM Pam @.***> wrote:

> 1, yes you need to compile the GC model file based on gc5Base.txt.gz from > the PennCNV package (in lib/ folder) 2. yes, it is only dependent on > reference genome > … <#m-2070931763492051298> > On Thu, Oct 27, 2022 at 10:58 PM Pam @.*> wrote: Hi, Kai Thank you for > your tool. I am trying to apply it in my work. I have microarray sequencing > data of about 3000 individuals with ASA ( Asian Screening Array) chips from > Illumina. I am trying to correct GC-waves for LRR/BAF using PennCNV.

  1. Do > I need to compile the GC model file myself based on gc5Base.txt.gz(hg19)? > 2. Can the above document be applied to non-European groups > (gc5Base.txt.gz) as well? I am looking forward to hearing from you soon. — > Reply to this email directly, view it on GitHub <#97 > <https://github.com/WGLab/PennCNV/issues/97&gt;&gt;, or unsubscribe > https://github.com/notifications/unsubscribe-auth/ABNG3OC7CQSE4D32LJ55JC3WFM6NVANCNFSM6AAAAAARQUSMIM > < https://github.com/notifications/unsubscribe-auth/ABNG3OC7CQSE4D32LJ55JC3WFM6NVANCNFSM6AAAAAARQUSMIM&gt;

> . You are receiving this because you are subscribed to this thread.Message > ID: @.*> > > Hi, Kai > I have another question about "genomic_wave.pl". > Are autosomal chromosomes and mitochondria probes corrected for GC waves > using "PennCNV", separately? Because when I was using " genomic_wave.pl" I > found it had a parameter "--distance ", which refers to the minimum > marker-marker distance for training model (default=1Mb). However, the > length of the chromosomal mitochondria was only 16569 bp. So what > (--distance ) is the appropriate setting for mitochondria? > > I am looking forward to hearing from you soon. > > — > Reply to this email directly, view it on GitHub > < https://github.com/WGLab/PennCNV/issues/97#issuecomment-1294975511&gt;, or > unsubscribe > < https://github.com/notifications/unsubscribe-auth/ABNG3OGF6KW5QJDBY2FTKFTWFPFMTANCNFSM6AAAAAARQUSMIM&gt;

> . > You are receiving this because you commented.Message ID: > @.***> >

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/WGLab/PennCNV/issues/97#issuecomment-1295030330, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OBA2LOOXXAR5ICYWSDWFPLKPANCNFSM6AAAAAARQUSMIM . You are receiving this because you commented.Message ID: @.***>

Captain-Pam commented 1 year ago

Hi, Kai Thank you for your advice. I am mainly focused on mitochondrial copy number. I don't know how much GC affects the mitochondrial copy number estimates. To summarize, If I want to perform GC-WAVES correction, I first compile the GC model of the mitochondria using "cal_gc_snp.pl" and then correct GC waves with "genomic_wave.pl", setting "--distance " to 1 kb. I am looking forward to hearing from you soon.

kaichop commented 1 year ago

No, you cannot compile GC model using cal_gc_snp.pl file because it requires a specific input file that does not include mitochondria information. You need to compile the GC model yourself, by writing a script yourself that calculates the GC content of each window around a marker. Because of the 16kb size, I doubt that GC has too much influence on copy number estimates though. So if it is too challenging, you do not need to do adjustment and see how the results go first.

On Mon, Oct 31, 2022 at 9:34 PM Pam @.***> wrote:

Hi, Kai Thank you for your advice. I am mainly focused on mitochondrial copy number. I don't know how much GC affects the mitochondrial copy number estimates. To summarize, If I want to perform GC-WAVES correction, I first compile the GC model of the mitochondria using "cal_gc_snp.pl" and then correct GC waves with "genomic_wave.pl", setting "--distance " to 1 kb. I am looking forward to hearing from you soon.

— Reply to this email directly, view it on GitHub https://github.com/WGLab/PennCNV/issues/97#issuecomment-1297895876, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OF67MEB6VHA5KPE6D3WGBXS7ANCNFSM6AAAAAARQUSMIM . You are receiving this because you commented.Message ID: @.***>

Captain-Pam commented 1 year ago

Hi, Kai Thank you for your reply. The main problem of my work is to perform GC-WAVE correction of the LRR of all SNPs in the array and to estimate the copy number using corrected-LRR of mitochondria. Why can't I compile GC MODEL? I found that the file "gc5Base" downloaded on PennCNV github contains mitochondrial information (chrM in hg19.gc5Base.txt). The mitochondrial information is as follows (total 4 rows, that is four ~5kb fragments):

585 chrM 0 5120 chrM.566093 5 1024 579463291 /gbdb/hg19/wib/gc5Base.wib 0 100 1024 45520 2531200 585 chrM 5120 10240 chrM.566094 5 1024 579464315 /gbdb/hg19/wib/gc5Base.wib 0 100 1024 45900 2605200 585 chrM 10240 15360 chrM.566095 5 1024 579465339 /gbdb/hg19/wib/gc5Base.wib 0 100 1024 45160 2511200 585 chrM 15360 16570 chrM.566096 5 242 579466363 /gbdb/hg19/wib/gc5Base.wib 0 100 242 10840 603200

So I can use this file to compile the GC model for autosomes and mitochondria using "cal_gc_snp.pl".

Considering that mitochondria are smaller, it should also be possible to train autosomes (1Mb) and mitochondria (1kb) separately.

kaichop commented 1 year ago

I meant you have to write your own script to compile GC statistics, because the current gc5Base is actually 5kb resolution, not 5bp resolution. You can write a script yourself, taking the mitochondria sequence, and then calculate the GC content for each 1kb window and consider the circular shape as well as current calculation always assumes linear genome. (You do not need to train any model; it should work directly as long as you have the GC content information in the same gcmodel file) Furthermore, chrM does not mean the same mitochondria that you may be using. I explained this in question 46 of https://annovar.openbioinformatics.org/en/latest/misc/faq/. So it is best that you use the exact mitochondria reference sequence that you are using for the SNP array.

On Wed, Nov 2, 2022 at 4:57 AM Pam @.***> wrote:

Hi, Kai Thank you for your reply. The main problem of my work is to perform GC-WAVE correction of the LRR of all SNPs in the array and to estimate the copy number using corrected-LRR of mitochondria. Why can't I compile GC MODEL? I found that the file "gc5Base" downloaded on PennCNV github contains mitochondrial information (chrM in hg19.gc5Base.txt). The mitochondrial information is as follows (total 4 rows, that is four ~5kb fragments):

585 chrM 0 5120 chrM.566093 5 1024 579463291 /gbdb/hg19/wib/gc5Base.wib 0 100 1024 45520 2531200 585 chrM 5120 10240 chrM.566094 5 1024 579464315 /gbdb/hg19/wib/gc5Base.wib 0 100 1024 45900 2605200 585 chrM 10240 15360 chrM.566095 5 1024 579465339 /gbdb/hg19/wib/gc5Base.wib 0 100 1024 45160 2511200 585 chrM 15360 16570 chrM.566096 5 242 579466363 /gbdb/hg19/wib/gc5Base.wib 0 100 242 10840 603200

So I can use this file to compile the GC model for autosomes and mitochondria using "cal_gc_snp.pl".

Considering that mitochondria are smaller, it should also be possible to train autosomes (1Mb) and mitochondria (1kb) separately.

— Reply to this email directly, view it on GitHub https://github.com/WGLab/PennCNV/issues/97#issuecomment-1299880262, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OBROYU3MNIHQYK6ARTWGIUIJANCNFSM6AAAAAARQUSMIM . You are receiving this because you commented.Message ID: @.***>

Captain-Pam commented 1 year ago

Hi, Kai Thank you for your time. I probably understand what you mean. For autosomes, the compiled gc model is still 1 Mb. For mitochondria, due to the gcBase 5kb frement, I cannot use gcBase to compile the gc model. Therefore I need to calculate the GC percentage ((G+C)/(G+C+A+T)) of 1kb around each SNP by myself. The file is similar to gcmodel, just like:

Name Chr Position GC rs3333 chrMT 123 0.34

Finally, I merge the two files into one and use "genomic_wave.pl" to correct the GC-waves for LRR.

My ASA arry's probe file “source” column shows different platforms, such as dbsnp, rCRS, NCBI. I think the source of the probes does not matter, so I'll just pick a rCRS as a reference. Thank you very much for your warm help!