Understanding .seg and .cna.seg output files

CuriusScientist commented 4 years ago

1) ID	chrom	start	end	num.mark	seg.median.logR	copy.number	call	subclone.status	logR_Copy_Number	Corrected_Copy_Number	Corrected_Call
x1	1	1000001	85500000	169	0.0796396516471526	3	GAIN	TRUE	3.11423640535062	3	GAIN
x1	1	85500001	91500000	12	0.322313941780044	4	AMP	FALSE	3.93033575436528	4	AMP
x1	1	91500001	120500000	58	0.074487787656138	3	GAIN	TRUE	3.08354976292766	3	GAIN
x1	1	146500001	178000000	63	0.48079976280186	5	HLAMP	FALSE	4.77866101549492	5	HLAMP
x1	1	178000001	243000000	130	0.293480238692837	4	AMP	FALSE	3.78576951498881	4	AMP
x1	1	243500001	246500000	6	-0.449437015307975	1	HETD	FALSE	0.903546917922327	1	HETD
x1	1	246500001	248500000	4	0.274288066563892	4	AMP	FALSE	3.69113317156701	4	AMP
x1	2	500001	243000000	485	-0.132844802060097	2	NEUT	FALSE	1.95357791771355	2	NEUT
x1	3	500001	197500000	394	0.134258643200168	3	GAIN	FALSE	3.03746830367397	3	GAIN
x1	4	500001	54000000	107	-0.115982582422186	2	NEUT	FALSE	2.01623034982981	2	NEUT

I am using 500 kb window In .seg output file, the chromosome starts from 500001 or more.

Can anyone explain me the reason behind it

2) chr	start	end	x1.copy.number	x1.event	x1.logR	x1.subclone.status	x1.Corrected_Copy_Number	x1.Corrected_Call	x1.logR_Copy_Number
1	1000001	1500000	3	GAIN	NA	1	3	GAIN	NA
1	1500001	2000000	3	GAIN	0.0915	1	3	GAIN	3.18529948326013
1	2000001	2500000	3	GAIN	NA	1	3	GAIN	NA
1	3000001	3500000	3	GAIN	0.0091	1	3	GAIN	2.7034467672483

looking at the output from .cna.seg file I can see that some logR_Copy_Number has "NA" as values. Can someone tell me why is NA here and how does it is derived?

3) Moreover, I found two instances in .cna.seg file where chromosome started from position 1

chr	start	end	x1.copy.number	x1.event	x1.logR	x1.subclone.status	x1.Corrected_Copy_Number	x1.Corrected_Call	x1.logR_Copy_Number
2	1	500000	2	NEUT	NA	0	2	NEUT	NA
4	1	500000	2	NEUT	NA	0	2	NEUT	NA

Why I never found this information in .seg file?

4) am I correct in assuming that logR_Copy_Number in .seg file is the median of values from .cna.seg file

gavinha commented 4 years ago

Hi @CuriusScientist

1)

ID chrom start end num.mark seg.median.logR copy.number call subclone.status logR_Copy_Number Corrected_Copy_Number Corrected_Call x1 1 1000001 85500000 169 0.0796396516471526 3 GAIN TRUE 3.11423640535062 3 GAIN x1 1 85500001 91500000 12 0.322313941780044 4 AMP FALSE 3.93033575436528 4 AMP x1 1 91500001 120500000 58 0.074487787656138 3 GAIN TRUE 3.08354976292766 3 GAIN x1 1 146500001 178000000 63 0.48079976280186 5 HLAMP FALSE 4.77866101549492 5 HLAMP x1 1 178000001 243000000 130 0.293480238692837 4 AMP FALSE 3.78576951498881 4 AMP x1 1 243500001 246500000 6 -0.449437015307975 1 HETD FALSE 0.903546917922327 1 HETD x1 1 246500001 248500000 4 0.274288066563892 4 AMP FALSE 3.69113317156701 4 AMP x1 2 500001 243000000 485 -0.132844802060097 2 NEUT FALSE 1.95357791771355 2 NEUT x1 3 500001 197500000 394 0.134258643200168 3 GAIN FALSE 3.03746830367397 3 GAIN x1 4 500001 54000000 107 -0.115982582422186 2 NEUT FALSE 2.01623034982981 2 NEUT I am using 500 kb window In .seg output file, the chromosome starts from 500001 or more.

Can anyone explain me the reason behind it

The first bin in each chromosome, regardless of size, tends to have outlier GC content or mappability score and is usually filtered out.

2)

chr start end x1.copy.number x1.event x1.logR x1.subclone.status x1.Corrected_Copy_Number x1.Corrected_Call x1.logR_Copy_Number 1 1000001 1500000 3 GAIN NA 1 3 GAIN NA 1 1500001 2000000 3 GAIN 0.0915 1 3 GAIN 3.18529948326013 1 2000001 2500000 3 GAIN NA 1 3 GAIN NA 1 3000001 3500000 3 GAIN 0.0091 1 3 GAIN 2.7034467672483 looking at the output from .cna.seg file I can see that some logR_Copy_Number has "NA" as values. Can someone tell me why is NA here and how does it is derived?

Same answer as above. NA for logR is usually due to GC content and mappability values being incompatible or outliers.

3) Moreover, I found two instances in .cna.seg file where chromosome started from position 1

chr start end x1.copy.number x1.event x1.logR x1.subclone.status x1.Corrected_Copy_Number x1.Corrected_Call x1.logR_Copy_Number 2 1 500000 2 NEUT NA 0 2 NEUT NA 4 1 500000 2 NEUT NA 0 2 NEUT NA Why I never found this information in .seg file?

Those bins have NA for logR (for reasons same as above), and subsequent analysis of these bins return NA, so they are ignored in the segment file.

4) am I correct in assuming that logR_Copy_Number in .seg file is the median of values from .cna.seg file

The logR_Copy_Number is the logR after correction for estimated tumor fraction and ploidy. The Corrected_Copy_Number is computed as round(logR_Copy_Number); this is the copy number result you should use.

Hope this helps, Gavin

CuriusScientist commented 4 years ago

@gavinha Many thanks for the explanation.

broadinstitute / ichorCNA

Understanding .seg and .cna.seg output files #75