WGLab / PennCNV

Copy number vaiation detection from SNP arrays
http://penncnv.openbioinformatics.org
Other
89 stars 55 forks source link

Output file sometimes huge, sometimes small, regardless of number of samples??? #107

Open Jolleboll opened 1 year ago

Jolleboll commented 1 year ago

Hello there!

I am running a home-made pipeline where I start with .idat files that I run through MOCHA: https://github.com/freeseek/mocha

This gives me .bcf files that I parse with my own Python script to create .pfb files and sample input files for PennCNV.

My question is, sometimes the .log file is big, sometimes it's small, and sometimes the .tsv file is very small, sometimes it's HUGE - seemingly regardless of how many samples I used. See the table below. These are all Illumina human exome arrays run at different times in the last 15 or so years.

.tsv size         .log size           sample size
3.2M                15M                955
 19M                93M                800
5.9G                19M                1050
 32M               1.4M                185
5.1M                15M                680
 14G                70M                1325
223M               397M                7637
946M                14M                500

As far as I understand things, a big .tsv file implies many CNV calls, and you wrote in another issue that this implies low quality data - my smallest .tsv file is 19M. Shall I consider this "botched", or "devoid of false positives"?

When I manually look at the biggest and smallest .tsv files, to compare, I notice that the bigger file has enormous numbers of cn=0, and also the average number of "numsnp" is much lower. Is this what you mean with "low quality data"? I know for a fact that some of the arrays were much sparser than others, but I was newly employed and know few details of how these idats came to be.

Thank you so much in advance, I have no one else to ask, everyone trusts me to get this right :))