broadinstitute / ichorCNA

Estimating tumor fraction in cell-free DNA from ultra-low-pass whole genome sequencing.
GNU General Public License v3.0
164 stars 88 forks source link

Understanding .seg and .cna.seg output files #75

Closed CuriusScientist closed 4 years ago

CuriusScientist commented 4 years ago
1) ID chrom start end num.mark seg.median.logR copy.number call subclone.status logR_Copy_Number Corrected_Copy_Number Corrected_Call
x1 1 1000001 85500000 169 0.0796396516471526 3 GAIN TRUE 3.11423640535062 3 GAIN
x1 1 85500001 91500000 12 0.322313941780044 4 AMP FALSE 3.93033575436528 4 AMP
x1 1 91500001 120500000 58 0.074487787656138 3 GAIN TRUE 3.08354976292766 3 GAIN
x1 1 146500001 178000000 63 0.48079976280186 5 HLAMP FALSE 4.77866101549492 5 HLAMP
x1 1 178000001 243000000 130 0.293480238692837 4 AMP FALSE 3.78576951498881 4 AMP
x1 1 243500001 246500000 6 -0.449437015307975 1 HETD FALSE 0.903546917922327 1 HETD
x1 1 246500001 248500000 4 0.274288066563892 4 AMP FALSE 3.69113317156701 4 AMP
x1 2 500001 243000000 485 -0.132844802060097 2 NEUT FALSE 1.95357791771355 2 NEUT
x1 3 500001 197500000 394 0.134258643200168 3 GAIN FALSE 3.03746830367397 3 GAIN
x1 4 500001 54000000 107 -0.115982582422186 2 NEUT FALSE 2.01623034982981 2 NEUT

I am using 500 kb window In .seg output file, the chromosome starts from 500001 or more.

Can anyone explain me the reason behind it

2) chr start end x1.copy.number x1.event x1.logR x1.subclone.status x1.Corrected_Copy_Number x1.Corrected_Call x1.logR_Copy_Number
1 1000001 1500000 3 GAIN NA 1 3 GAIN NA
1 1500001 2000000 3 GAIN 0.0915 1 3 GAIN 3.18529948326013
1 2000001 2500000 3 GAIN NA 1 3 GAIN NA
1 3000001 3500000 3 GAIN 0.0091 1 3 GAIN 2.7034467672483

looking at the output from .cna.seg file I can see that some logR_Copy_Number has "NA" as values. Can someone tell me why is NA here and how does it is derived?

3) Moreover, I found two instances in .cna.seg file where chromosome started from position 1

chr start end x1.copy.number x1.event x1.logR x1.subclone.status x1.Corrected_Copy_Number x1.Corrected_Call x1.logR_Copy_Number
2 1 500000 2 NEUT NA 0 2 NEUT NA
4 1 500000 2 NEUT NA 0 2 NEUT NA

Why I never found this information in .seg file?

4) am I correct in assuming that logR_Copy_Number in .seg file is the median of values from .cna.seg file

gavinha commented 4 years ago

Hi @CuriusScientist

1)

ID chrom start end num.mark seg.median.logR copy.number call subclone.status logR_Copy_Number Corrected_Copy_Number Corrected_Call x1 1 1000001 85500000 169 0.0796396516471526 3 GAIN TRUE 3.11423640535062 3 GAIN x1 1 85500001 91500000 12 0.322313941780044 4 AMP FALSE 3.93033575436528 4 AMP x1 1 91500001 120500000 58 0.074487787656138 3 GAIN TRUE 3.08354976292766 3 GAIN x1 1 146500001 178000000 63 0.48079976280186 5 HLAMP FALSE 4.77866101549492 5 HLAMP x1 1 178000001 243000000 130 0.293480238692837 4 AMP FALSE 3.78576951498881 4 AMP x1 1 243500001 246500000 6 -0.449437015307975 1 HETD FALSE 0.903546917922327 1 HETD x1 1 246500001 248500000 4 0.274288066563892 4 AMP FALSE 3.69113317156701 4 AMP x1 2 500001 243000000 485 -0.132844802060097 2 NEUT FALSE 1.95357791771355 2 NEUT x1 3 500001 197500000 394 0.134258643200168 3 GAIN FALSE 3.03746830367397 3 GAIN x1 4 500001 54000000 107 -0.115982582422186 2 NEUT FALSE 2.01623034982981 2 NEUT I am using 500 kb window In .seg output file, the chromosome starts from 500001 or more.

Can anyone explain me the reason behind it

The first bin in each chromosome, regardless of size, tends to have outlier GC content or mappability score and is usually filtered out.

2)

chr start end x1.copy.number x1.event x1.logR x1.subclone.status x1.Corrected_Copy_Number x1.Corrected_Call x1.logR_Copy_Number 1 1000001 1500000 3 GAIN NA 1 3 GAIN NA 1 1500001 2000000 3 GAIN 0.0915 1 3 GAIN 3.18529948326013 1 2000001 2500000 3 GAIN NA 1 3 GAIN NA 1 3000001 3500000 3 GAIN 0.0091 1 3 GAIN 2.7034467672483 looking at the output from .cna.seg file I can see that some logR_Copy_Number has "NA" as values. Can someone tell me why is NA here and how does it is derived?

Same answer as above. NA for logR is usually due to GC content and mappability values being incompatible or outliers.

3) Moreover, I found two instances in .cna.seg file where chromosome started from position 1

chr start end x1.copy.number x1.event x1.logR x1.subclone.status x1.Corrected_Copy_Number x1.Corrected_Call x1.logR_Copy_Number 2 1 500000 2 NEUT NA 0 2 NEUT NA 4 1 500000 2 NEUT NA 0 2 NEUT NA Why I never found this information in .seg file?

Those bins have NA for logR (for reasons same as above), and subsequent analysis of these bins return NA, so they are ignored in the segment file.

4) am I correct in assuming that logR_Copy_Number in .seg file is the median of values from .cna.seg file

The logR_Copy_Number is the logR after correction for estimated tumor fraction and ploidy. The Corrected_Copy_Number is computed as round(logR_Copy_Number); this is the copy number result you should use.

Hope this helps, Gavin

CuriusScientist commented 4 years ago

@gavinha Many thanks for the explanation.