This gives me .bcf files that I parse with my own Python script to create .pfb files and sample input files for PennCNV.
My question is, sometimes the .log file is big, sometimes it's small, and sometimes the .tsv file is very small, sometimes it's HUGE - seemingly regardless of how many samples I used. See the table below. These are all Illumina human exome arrays run at different times in the last 15 or so years.
As far as I understand things, a big .tsv file implies many CNV calls, and you wrote in another issue that this implies low quality data - my smallest .tsv file is 19M. Shall I consider this "botched", or "devoid of false positives"?
When I manually look at the biggest and smallest .tsv files, to compare, I notice that the bigger file has enormous numbers of cn=0, and also the average number of "numsnp" is much lower. Is this what you mean with "low quality data"? I know for a fact that some of the arrays were much sparser than others, but I was newly employed and know few details of how these idats came to be.
Thank you so much in advance, I have no one else to ask, everyone trusts me to get this right :))
Hello there!
I am running a home-made pipeline where I start with .idat files that I run through MOCHA: https://github.com/freeseek/mocha
This gives me .bcf files that I parse with my own Python script to create .pfb files and sample input files for PennCNV.
My question is, sometimes the .log file is big, sometimes it's small, and sometimes the .tsv file is very small, sometimes it's HUGE - seemingly regardless of how many samples I used. See the table below. These are all Illumina human exome arrays run at different times in the last 15 or so years.
As far as I understand things, a big .tsv file implies many CNV calls, and you wrote in another issue that this implies low quality data - my smallest .tsv file is 19M. Shall I consider this "botched", or "devoid of false positives"?
When I manually look at the biggest and smallest .tsv files, to compare, I notice that the bigger file has enormous numbers of cn=0, and also the average number of "numsnp" is much lower. Is this what you mean with "low quality data"? I know for a fact that some of the arrays were much sparser than others, but I was newly employed and know few details of how these idats came to be.
Thank you so much in advance, I have no one else to ask, everyone trusts me to get this right :))