BoevaLab / FREEC

Control-FREEC: Copy number and genotype annotation in whole genome and whole exome sequencing data
151 stars 49 forks source link

questions about the output #140

Open ZYongQi opened 9 months ago

ZYongQi commented 9 months ago

Hi,thisi is ZY.I sued freec to call CNVs in the genome successfully.But still two questions:

  1. the output like this: ID=gene-POFUT2 1 1255210 1274664 1264440 1302816 0 loss ID=gene-DYRK1A 1 7417053 7572985 7516776 7556136 8 gain ID=gene-TTC3 1 7693717 7749619 7699800 7826736 10 gain ID=gene-LOC117795648 1 7741425 7741527 7699800 7826736 10 gain ID=gene-LOC117801378 1 7751983 7752392 7699800 7826736 10 gain ID=gene-LOC100480655 1 7791807 7792944 7699800 7826736 10 gain ID=gene-LOC117801382 1 7795812 7796932 7699800 7826736 10 gain ID=gene-LOC117801055 1 7806381 7811440 7699800 7826736 10 gain ID=gene-LOC117801383 1 7820518 7824867 7699800 7826736 10 gain ID=gene-HLCS 1 7834905 8035784 7925136 7958592 0 loss ID=gene-LOC117801776 1 44252319 44863795 44619480 44641128 0 loss

it contains predicted copy number.I wonder what it refers if this value equals 0 ?

  1. CNV is a region on genome,whose sizes approximately ranges from 1kb to 3Mb.How can I get the gene copy numbers from CNVs?

Thank you for your any valuable advice.Best wishes to you!

valeu commented 9 months ago

Hello,

  1. 0 means Zero copies of DNA in this region predited.
  2. I guess you need to look at this value before 'gain' and 'loss'. Also visualize the ratio.txt information on the normalized ratio to make sure that the prediction is correct.
ZYongQi commented 9 months ago

Hello,

  1. 0 means Zero copies of DNA in this region predited.
  2. I guess you need to look at this value before 'gain' and 'loss'. Also visualize the ratio.txt information on the normalized ratio to make sure that the prediction is correct.

Thank you for your reply.I 'll visualize the ratio.txt information on the normalized ratio at once .Now please allow me to simply introduce my "config.txt".And I've been confused about the "CNVs file". This is part of my config file:

ploidy = 2 breakPointThreshold = 0.8 maxThreads = 16 minExpectedGC = 0.35 maxExpectedGC = 0.55 telocentromeric = 0 coefficientOfVariation = 0.062 degree = 3

  1. I chose the value coefficientOfVariation rather than a fixed bin size.In that case,freec can choose an optimal window size for each sample.Will different windows influence the analysis if I try to combine these CNVs output of different samples? Or will you suggest me to choose a fixed window size like 100bp or else? By the way,the value 0.062 comes from a similar research.

  2. I try to locate the CNVs to the gene like this:

GENE_ID CHROMOSOME GENE_START GENE_STOP CNV_START CNV_STOP CN TYPE ID=gene-POFUT2 1 1255210 1274664 1264440 1302816 0 loss ID=gene-DYRK1A 1 7417053 7572985 7516776 7556136 8 gain ID=gene-TTC3 1 7693717 7749619 7699800 7826736 10 gain ID=gene-LOC117795648 1 7741425 7741527 7699800 7826736 10 gain ID=gene-LOC117801378 1 7751983 7752392 7699800 7826736 10 gain ID=gene-LOC100480655 1 7791807 7792944 7699800 7826736 10 gain ID=gene-LOC117801382 1 7795812 7796932 7699800 7826736 10 gain ID=gene-LOC117801055 1 7806381 7811440 7699800 7826736 10 gain ID=gene-LOC117801383 1 7820518 7824867 7699800 7826736 10 gain

I wonder the connection between CN and GENE_location(start and stop).10 means 10 copies of DNA in the region predicted.Does it mean a CNV repeat 10 times or just 10 different CNVs?If I want to count the numbers of gain and loss,do I need to multiply by 10?

valeu commented 9 months ago

coefficientOfVariation = 0.062 will give you some OK window side that will not result in too much noise and false predictions. If this value calculated by FREEC is close to 100, just use window=100 and it will overwrite coefficientOfVariation. Also, you can use a rule of thumb: 400 reads per window will result in low noise and nice predictions.

valeu commented 9 months ago

Regarding the annotation of genes - I don't think that there is an official FREEC script to do so. How do you get this file with gene IDs?

ZYongQi commented 9 months ago

Regarding the annotation of genes - I don't think that there is an official FREEC script to do so. How do you get this file with gene IDs?

I did make the annotation myself through a perl script.Actually I did the step on the base of the position of predicted CNVs in the output file from FREEC.

To be specific,at first I got the position(start-end) of each gene in the .gff file from NCBI.Second,I looked for genes that overlap with CNV regions by the following standard:cnv_start<=gene_stop && cnv_stop >=gene_start.In this way,I will get a gene list whose position(start-end) overlaps with CNVs.Finally I merged the two file. Is this step any problems?

By the way,I 've got the ratio.txt,but I wonder how the ratio value is calculated. Should I filter out ratio values that don't meet a certain threshold? And why the copy number in the ratio.txt appears all 2?

I would appreciate it if your any advice is helpful.Best wishes!

valeu commented 9 months ago

The copy number of the ratio.txt for the control sample should be 2 if you use a control. For the donor sample, it can be 2 almost everywhere if it is not a cancer sample. In any case, I suggest visualizing the output (ratio.txt) to make sure you can trust the predictions of FREE (using for example the R script included in the package). The ratios are normalized read count values. 1 means no change. -1 means Data not available.

ZYongQi commented 4 months ago

Hi,thisi is ZY.We did a summary on the quantity and distribution of CNVs and CNV regions . And I took your advice to visualize the ratio.txt file.But still doubted.

R script:FREEC_ratio2Absolute.R. One of the outputs shows:

Chromosome Start End Num_Probes Segment_Mean NC_048218.1 1 1264440 1285 -0.0513244 NC_048218.1 1264441 1302816 39 -3.715107 NC_048218.1 1302817 3479424 2212 -0.05671026 NC_048218.1 3479425 3504024 25 -4.576851 NC_048218.1 3504025 3536496 33 0.01631089

What kind of criteria should we use to filter the results? The number of probes or a specific segment_mean? By the way, why some of segment_means equal -Inf?