WGLab / PennCNV

Copy number vaiation detection from SNP arrays
http://penncnv.openbioinformatics.org
Other
88 stars 53 forks source link

Creating individual log and BAF input files #81

Closed sophie-03 closed 2 years ago

sophie-03 commented 2 years ago

I am trying to run PennCNV on a large set of samples (500,000 individuals). My data is currently in the form of two files: one containing all of the log allele frequency data for all samples and one containing all the BAF data for all samples. As I understand, the format for log and BAF data required by PennCNV is a separate file for each sample, containing both log allele and BAF data. With so many samples, creating these individual files is using too much computing power to be feasible for me.

Is it possible to input data without creating a separate file for each sample? Or is there an 'easy' way to create the individual files that is less computer-intensive? I'm currently using a bash script that contains a for loop to cut and paste the columns from the original files into a new file for each sample.

kaichop commented 2 years ago

If the markers are indeed in the same order in the two files, then you can simply "cut" and "paste" to generate files on the fly and do analysis. This is probably what you are already doing. In practice, if you have say for example 50 different computing nodes, then it makes sense to create 50 x 2 separate files (each with 1000 samples), and use the procedure above, which should be much faster due to decreased file reading.

On Wed, Apr 20, 2022 at 7:30 AM Sophie @.***> wrote:

I am trying to run PennCNV on a large set of samples (500,000 individuals). My data is currently in the form of two files: one containing all of the log allele frequency data for all samples and one containing all the BAF data for all samples. As I understand, the format for log and BAF data required by PennCNV is a separate file for each sample, containing both log allele and BAF data. With so many samples, creating these individual files is using too much computing power to be feasible for me.

Is it possible to input data without creating a separate file for each sample? Or is there an 'easy' way to create the individual files that is less computer-intensive? I'm currently using a bash script that contains a for loop to cut and paste the columns from the original files into a new file for each sample.

— Reply to this email directly, view it on GitHub https://github.com/WGLab/PennCNV/issues/81, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3ODMLK3IPMPO7J2V7K3VF7TFJANCNFSM5T3XMCFQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>