WGLab / PennCNV

Copy number vaiation detection from SNP arrays
http://penncnv.openbioinformatics.org
Other
88 stars 53 forks source link

LOH / state4,cn=2 is never called #92

Closed Nicolai-vKuegelgen closed 1 year ago

Nicolai-vKuegelgen commented 1 year ago

I am currently trying to set up a (mostly) automated CNV calling pipeline including PennCNV. Installation (from source) worked without issue as does running the program itself.

However, when investigating the results for a large set of samples (>150), I never see even a single detected loss of heterozygostiy (LOH) state, which should have cn=2 (state4 in the hmm model afaik). I am working with data has been analysed before (manually with GenomeStudio), I mostly know what to expect (and gain/loss CNVs are matching these expectations).

This is the command I'm using: PennCNV-1.0.5/detect_cnv.pl -test -hmm PennCNV-1.0.5/lib/hhall.hmm -pfb compiled-samples.pfb array-data-tsv/*.tsv -out array-results/full_analysis_run.txt -log array-results/full_analysis_run.log -confidence

I've created my pfb file by running compile_pfb.pl on all samples. I imagine that this could maybe be an issue, as the samples are all from celllines, several of which of which might be derived from another. So overall heterogeneity between samples is likely lower than usual for a dataset of comparable size.

kaichop commented 1 year ago

To enable the state4, you need to modify the HMM file so that the expected LRR value is 0 for that state (right now it shows 100 so it will not be called). Most users are not interested in copy neutral LOH events so this is by default disabled.

On Mon, Sep 26, 2022 at 6:58 AM Nicolai von Kügelgen < @.***> wrote:

I am currently trying to set up a (mostly) automated CNV calling pipeline including PennCNV. Installation (from source) worked without issue as does running the program itself.

However, when investigating the results for a large set of samples (>150), I never see even a single detected loss of heterozygostiy (LOH) state, which should have cn=2 (state4 in the hmm model afaik). I am working with data has been analysed before (manually with GenomeStudio), I mostly know what to expect (and gain/loss CNVs are matching these expectations).

This is the command I'm using: PennCNV-1.0.5/detect_cnv.pl -test -hmm PennCNV-1.0.5/lib/hhall.hmm -pfb compiled-samples.pfb array-data-tsv/*.tsv -out array-results/full_analysis_run.txt -log array-results/full_analysis_run.log -confidence

I've created my pfb file by running compile_pfb.pl on all samples. I imagine that this could maybe be an issue, as the samples are all from celllines, several of which of which might be derived from another. So overall heterogeneity between samples is likely lower than usual for a dataset of comparable size.

— Reply to this email directly, view it on GitHub https://github.com/WGLab/PennCNV/issues/92, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OF52EHEHQO4WT4X3MTWAF6VFANCNFSM6AAAAAAQVVRFUI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Nicolai-vKuegelgen commented 1 year ago

Thanks for that help. Sadly it did not work yet.

I've changed this line in the hmm file:

B1_mean:
-3.527211 -0.664184 0.000000 100.000000 0.395621 0.678345 

To this:

B1_mean:
-3.527211 -0.664184 0.000000 0.000000 0.395621 0.678345

Is there anything else I need to do? So far I still don't get any LOH calls.

kaichop commented 1 year ago

can you show the actual output (LOG and WARNING message for example) after running the command to help diagnose the issue.

Nicolai-vKuegelgen commented 1 year ago

This is what the log file shows for a single sample (there is little difference between samples, apart from exact numbers, some samples have a GC-waviness warning).

The 'WARNING: Unrecognizable LRR or BAF values are treated as zero:' line is repeated quite often when I'm using unfiltered input files, but even if I pre-filter the input I still don't get LOH calls.

NOTICE: All program notification/warning messages that appear in STDERR will be also written to log file array-results/full_analysis_run_loh.log
NOTICE: Reading marker coordinates and population frequency of B allele (PFB) from test.pfb ... Done with 677378 records (2888 records in chr 0,MT,XY were discarded)
NOTICE: Reading LRR and BAF values for from array-data-tsv/206674130142/206674130142_R11C01.tsv ...WARNING: Unrecognizable LRR or BAF values are treated as zero: LRR=-nan BAF=-nan
WARNING: Unrecognizable LRR or BAF values are treated as zero: LRR=-nan BAF=-nan
[...]
WARNING: Unrecognizable LRR or BAF values are treated as zero: LRR=-nan BAF=-nan
 Done with 677378 records in 24 chromosomes (52681 records are discarded due to lack of PFB information for the markers)
NOTICE: Data from chromosome X,Y will not be used in analysis
NOTICE: Median-adjusting LRR values for all autosome markers from array-data-tsv/206674130142/206674130142_R11C01.tsv by -0.0015
NOTICE: Median-adjusting BAF values for all autosome markers from array-data-tsv/206674130142/206674130142_R11C01.tsv by -0.0311
NOTICE: quality summary for array-data-tsv/206674130142/206674130142_R11C01.tsv: LRR_mean=0.0121 LRR_median=0.0000 LRR_SD=0.1472 BAF_mean=0.4991 BAF_median=0.5000 BAF_SD=0.0393 BAF_DRIFT=0.000288 WF=-0.0183 GCWF=-0.0108
kaichop commented 1 year ago

I do not see anything wrong in the LOG information, except the 52K records that are discarded. If you do not mind just emailing one example file (together with your PFB), I can do some diagnosis to see what is wrong. Also you may want to check whether your PFB file contains allele frequency spectrum from 0 to 1 (it is fine to have lower heterozygosity than typical sample collections), rather than being only 0 or 1.

Nicolai-vKuegelgen commented 1 year ago

The discarded records probes that don't have usable information - I will likely filter these out in the future before running analysis. The PFB file also looks proper from what I can tell (pfb values are continous between 0 and 1)

If you could look into this issue, that would be very much appreciated, I've checked and there should be no issue with sending you one of the files. I'll assume I can send the files to your chop.edu email?

Nicolai-vKuegelgen commented 1 year ago

Hey Kai,

I hope you got my email with the example files, if not please let me know so I can re-send it.

kaichop commented 1 year ago

I actually never received the example file. Please email a google drive link to @.*** directly. I checked chop.edu email and do not see it in inbox or spam, and I think microsoft just flagged and deleted the email without notifying me.

On Mon, Oct 17, 2022 at 4:14 AM Nicolai von Kügelgen < @.***> wrote:

Hey Kai,

I hope you got my email with the example files, if not please let me know so I can re-send it.

— Reply to this email directly, view it on GitHub https://github.com/WGLab/PennCNV/issues/92#issuecomment-1280462070, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OAHQTMKIC5J3AOOYWDWDUDG3ANCNFSM6AAAAAAQVVRFUI . You are receiving this because you commented.Message ID: @.***>

Nicolai-vKuegelgen commented 1 year ago

Since I already had an drive (like) link in my last email and the mail is censoered in comments here, I've now resend it to you gmail adress (from the github commit logs). Thanks!

kaichop commented 1 year ago

Yes I got it!

On Mon, Oct 24, 2022 at 11:27 AM Nicolai von Kügelgen < @.***> wrote:

Since I already had an drive (like) link in my last email and the mail is censoered in comments here, I've now resend it to you gmail adress (from the github commit logs). Thanks!

— Reply to this email directly, view it on GitHub https://github.com/WGLab/PennCNV/issues/92#issuecomment-1289209557, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNG3OHJZI6LUUEYXNTTL4TWE2TFFANCNFSM6AAAAAAQVVRFUI . You are receiving this because you commented.Message ID: @.***>

kaichop commented 1 year ago

I was able to generate LOH calls after editing 100 to 0 in the hhall.hmm file and after adding the "-loh" argument in command line. I forgot to mention the latter requirement.

./detect_cnv.pl -test -hmm lib/hhall.hmm -pfb /mounted/test.pfb /mounted/204362030005_R07C01.signal.txt -gcmodel test.gcmodel -loh

chr12:95690404-95944219 numsnp=113 length=253,816 state4,cn=2 /mounted/204362030005_R07C01.signal.txt startsnp=rs2431014 endsnp=GSA-rs7486703 chr13:56535612-58592616 numsnp=212 length=2,057,005 state4,cn=2 /mounted/204362030005_R07C01.signal.txt startsnp=GSA-rs189908706 endsnp=rs35573682 chr13:84478342-85583979 numsnp=176 length=1,105,638 state4,cn=2 /mounted/204362030005_R07C01.signal.txt startsnp=rs9546709 endsnp=GSA-rs76945249 chr22:41209635-42122422 numsnp=295 length=912,788 state4,cn=2 /mounted/204362030005_R07C01.signal.txt startsnp=GSA-rs139438 endsnp=rs764481

I noted that there are lots of negative PFB values, which cannot be right. It should be made to positive. I am not sure how it happened but perhaps there are negative BAF values in your input files that were used to generate the PFB?

Also this particular sample is very "wavy" which you can see from the figure below. You can consider using gcmodel to adjust for waviness.

204362030005_R07C01 signal txt chr3 110110185

Nicolai-vKuegelgen commented 1 year ago

Thanks !

The missing -loh option must have been the reason why I didn't get any LOH calls.

I've also noticed that several of the samples I have are quity wavy, but I haven't gotten around to making the GCmodel files yet. The negative PFB values probably do come from negative BAF values (I didn't use genomestudio which apparently automatcially restricts BAF to [0,1] while the underlying algorythm doesn't).