NCBI-Hackathons / Scan2CNV

MIT License
1 stars 0 forks source link

PennCNV GC model file #16

Closed ekarlins closed 7 years ago

ekarlins commented 7 years ago

Write code using the script "cal_gc_snp.pl", that comes with PennCNV, to generate a PFB file.

PennCNV is installed on our NCI cluster (CCAD/cgemsiii), so it's probably easiest to just run these tests there. On the cluster this is how you can see the help page for this script:

module load PennCNV/2015-v1.0.3 cal_gc_snp.pl -h

Please put working code for generating a GC model file in a .sh file in the "scripts" directory in this repo. i.e. test the code by submitting the bash script to the cluster using qsub. Point us to the .sh file and close this ticket once you are confident that this code works.

This file may be specific to the genome build, so we may omit this or just mention it as an option if it's too specific for our pipeline.

ekarlins commented 7 years ago

The GC model file is specific to genome build. This requires downloading a file from the UCSC browser. Using this file is optional for PennCNV. Our pipeline should allow PennCNV to be run using a GC model file, but we will rely on the user to generate this file.

See details below about how to generate this file. We may want to point user to this documentation.

cal_gc_snp.pl -h Usage: cal_gc_snp.pl [arguments]

 Optional arguments:
        -v, --verbose                   use verbose output
        -h, --help                      print help message
        -m, --man                       print complete documentation
            --numwindow <int>           number of sliding window (default=100, or 500kb on each side)
            --backgroundgc <float>      backgroud GC frequency (default=0.42)
            --output <file>             write output to this file

 Function: calculate GC content surrounding each marker within specified sliding 
 window, using the UCSC GC annotation file (for example, 
 http://hgdownload.cse.ucsc.edu/goldenPath/hg18/database/gc5Base.txt.gz for 
 human NCBI36 genome assembly) that is also sorted

 Example: cal_gc_snp.pl gc5Base.txt.sorted signalfile -output file.gcmodel

Options: --help print a brief help message and exit

--man   print the complete manual of how to use the program

--verbose
        use verbose output

--numwindow
        the number of non-overlapping sliding window on each side of the
        SNP.

--backgroundgc
        background GC level (for genomic regions without base
        information). By default it is 0.42 for human genome.

--output
        specify the output file name