BoevaLab / FREEC

Control-FREEC: Copy number and genotype annotation in whole genome and whole exome sequencing data
151 stars 49 forks source link

_CNVs file versus _ratio.txt file versus capture regions BED file #29

Open igordot opened 7 years ago

igordot commented 7 years ago

I am trying to understand the difference between the regions in _CNVs and _ratio.txt and capture regions BED files. From what I understand, _CNVs will have all the regions with alterations after merging neighboring regions. That would make it a subset of _ratio.txt, but I see regions in _CNVs that aren't present in _ratio.txt. Is that expected?

I also compared _ratio.txt to the capture regions BED file. They seem to be identical, but _ratio.txt is heavily filtered (more than half the regions are filtered). The filtering seems to be based on the matched normal (all _ratio.txt files using the same matched normal have the same length). The regions with CopyNumber set to -1 do not make it to _CNVs, since there is insufficient data there. What is the difference between -1 regions and completely missing regions and why are so many missing? I am looking at the BAMs at some of the missing regions and they seem okay.

valeu commented 7 years ago

I see regions in _CNVs that aren't present in _ratio.txt. Is that expected?

When a mappability in a window or several neibougring window is low, ratio file can contain '-1' values. But FREEC usually can make a guess about such regions. For instance if there is a gain on the left and on the right, this 'unknown' regions will be also assigned a gain status, and it will appear as such in the CNVs file.

I would not expect regions to be missing from ratio. They should simply get '-1' values. Only if you work with exome data some regions may disappear. Is your data WGS or WES?

igordot commented 7 years ago

It's WES. Why would the regions disappear and why would _CNVs have regions that aren't in ratio.txt?

valeu commented 7 years ago

because for WES there is not point to output all windows in the genome. So regions with few or no reads in the control dataset are removed. I think if you want to see all regions of the genome, you should set printNA=TRUE See http://boevalab.com/FREEC/tutorial.html#CONFIG

igordot commented 7 years ago

Thanks for clarifying.

So what should I do if I want to see copy number info for a specific region? Sometimes a region is only in _CNVs and sometimes it's only in ratio.txt. Is there a single file I can check?

valeu commented 6 years ago

Igor, as I understand it: the _CNVs file contains start and end positions of CNAs. ratio.txt contains values per bin or per exon. So to know copy number of a given region you can check whether this region is included into (or partially overlaps) any CNA from _CNVs . If it is not the case, the corresponding copy number is equal to the main ploidy.