ChristofferFlensburg / superFreq

Analysis pipeline for cancer sequencing data
MIT License
110 stars 33 forks source link

Somatic calls #1

Closed ramaniak closed 7 years ago

ramaniak commented 7 years ago

Hello, I have been playing with superFreq and have been running it with default parameters. For the initial testing I ran it on 2 cases (each with 3 samples : 2 tumours and a paired normal ) with 3 samples in them. I am using a cohort of 7 normal samples for the pool of normals. All of the samples are exome data with a minimum depth of 200X. The software runs successfully without any errors (just a few warnings that I have not paid attention to).

I notice that in the resulting river plot, all of the clones are derived only from the CNV calls and no SNVs. I then see that in the 'somatics' folder for each samples it says : "No somatic mutations detected". I am not sure if I am doing something wrong with parameters or flags. Some of the somatic variations in these samples have been verified and I would have expected to see them in the results.

Would you be able to point me on how to rectify this?

In a related note I was going through the river plot and the associated xls file (which I assume has the calls that were included in the river plot). How should I interpret cases where the same CNV appears twice? In the example below you will see the same start and end locations for a CNV with the only difference being the name for them.

chr start   end name    clone   clonality.A1    clonality.A2    clonality.A3
2   8998809 15884396    4Mbp AAB (3)    7   0.149859348 0.120431779 0.217749937
2   8998809 15884396    3Mbp 19AB (1)   7   0.149859348 0.120431779 0.217749937

thanks Arun

ChristofferFlensburg commented 7 years ago

Hey Ramaniak,

Thanks for reporting my first issue. :)

Missing somatic SNVs is often a case of cancer contamination in the matched normals or in the pool of reference normals. The easiest way to check for that is to look at the scatter plots between the cancer sample and the matched normal. A clean cancer-normal will have a line of red dots on the normal VAF = 0 axis, while cancer contamination will make the dots depart a bit from the axis. If the somatic SNVs are present in the reference normals they will be filtered, and you'll have to look at the "flagged" scatter and the filtered somatics will be present as pale orange circles (filtered non-dbSNP). If you see any large CNA calls in the normal (well, except sex chromsomes), that's also a sign of impure matched normal. If you think your normal has cancer contamination, simply switch of the normal flag in the meta data (set normal to "NO"). As long as the supposedly normal sample has low cancer content (below 50% or so) superFreq is pretty good at picking the cancer clone from the germline clone, and you'll be asily be able to identify the cancer clone in the river.

If you don't see any somatic SNVs in the scatter (even filtered), but you are still convinced they exist, then they are missing in the input .vcfs.

The two CNA calls in the river output look buggy. The chr, start and end column are identical, but the assigned chromosome in the name are chromosome 3 and 1, and with different size. Not intended behaviour. I did have issues with the naming a while ago, and maybe you're not running on the latest patch? If you're not running 0.9.17, maybe start with rerunning the clonal tracking (set forceRedo$forceRedoStories = T) on the new version. If that's not the issue I'd need to have a closer look on your output. Do you see these CNA calls in the CNA plots?

cheers, /Christoffer

ramaniak commented 7 years ago

Hello Christoffer, Thanks for the quick response and I am glad to be the author of your first issue :)

I did look through the scatter and see exactly what you mentioned...pale orange circles in the tumour vs normal scatter. I will try running the samples with the normal set to "NO". I believe there is tumour contamination in my panel of normals and that is the likely cause and I also see large CNAs in normals.

I checked the version I have and it is "superFreq_0.9.17". While they don't appear in the river plot (one of clones says 'and more...') they are listed as part of the line plots that follow the river plot.

Will keep you posted on how the run with the normal set to "NO" proceeds.

thanks again

Cheers Arun

On Tue, May 16, 2017 at 9:27 PM, Christoffer Flensburg < notifications@github.com> wrote:

Hey Ramaniak,

Thanks for reporting my first issue. :)

Missing somatic SNVs is often a case of cancer contamination in the matched normals or in the pool of reference normals. The easiest way to check for that is to look at the scatter plots between the cancer sample and the matched normal. A clean cancer-normal will have a line of red dots on the normal VAF = 0 axis, while cancer contamination will make the dots depart a bit from the axis. If the somatic SNVs are present in the reference normals they will be filtered, and you'll have to look at the "flagged" scatter and the filtered somatics will be present as pale orange circles (filtered non-dbSNP). If you see any large CNA calls in the normal (well, except sex chromsomes), that's also a sign of impure matched normal. If you think your normal has cancer contamination, simply switch of the normal flag in the meta data (set normal to "NO"). As long as the supposedly normal sample has low cancer content (below 50% or so) superFreq is pretty good at picking the cancer clone from the germline clone, and you'll be asily be able to identify the cancer clone in the river.

If you don't see any somatic SNVs in the scatter (even filtered), but you are still convinced they exist, then they are missing in the input .vcfs.

The two CNA calls in the river output look buggy. The chr, start and end column are identical, but the assigned chromosome in the name are chromosome 3 and 1, and with different size. Not intended behaviour. I did have issues with the naming a while ago, and maybe you're not running on the latest patch? If you're not running 0.9.17, maybe start with rerunning the clonal tracking (set forceRedo$forceRedoStories = T) on the new version. If that's not the issue I'd need to have a closer look on your output. Do you see these CNA calls in the CNA plots?

cheers, /Christoffer

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ChristofferFlensburg/superFreq/issues/1#issuecomment-301959338, or mute the thread https://github.com/notifications/unsubscribe-auth/AFzoBpQsRp_Z71K9ezNFklK8M7sTQ7plks5r6kz2gaJpZM4NczTm .

ChristofferFlensburg commented 7 years ago

So it seems like the somatics are filtered, which probably is caused by them being present in the reference normals. To fix that you need to remove the reference normal(s) with the somatic SNVs in them. superFreq looks for .bam files in the normalDirectory/bam, so you can just add a .bak or whatever to the end of the violating reference normal .bam files to make superFreq not find them. To rerun, also delete the R directory inside the normalDirectory (where the saved data from previous runs is stored), and rerun with forceRedo$forceRedoMatchFlag = T. If the (filtered) somatic SNVs are close to VAF = 0 in the matched normal, there is no need to relabel it as not normal in the metaData.

Regarding the weird CNA calls, I'd need to have a closer look at the log and/or the output. :/ Maybe you can start with showing me the .log file in your R directory (or the individuals R subdirectory if you run in split mode). It contains the meta data of your run as well as some basic stats on the output, so if you dont feel comfortable sharing that here you can email it to me, or I can help you extract the relevant part.

cheers, /Christoffer

ramaniak commented 7 years ago

Hello,

Do I have to remove the reference normals? Your previous comment said that "As long as the supposedly normal sample has low cancer content (below 50% or so) superFreq is pretty good at picking the cancer clone from the germline clone, and you'll be easily be able to identify the cancer clone in the river."

I am currently trying just by running with "NO" for normals. I am working with a set of cancers where it is quite difficult to obtain 100% normal tissue! Could you send me your email id to share the logs?

thanks Arun

On Tue, May 16, 2017 at 11:08 PM, Christoffer Flensburg < notifications@github.com> wrote:

So it seems like the somatics are filtered, which probably is caused by them being present in the reference normals. To fix that you need to remove the reference normal(s) with the SNVs (to rerun that, also delete the R directory inside the normalDirectory, and rerun with forceRedo$forceRedoMatchFlag = T). If the (filtered) somatic SNVs are close to VAF = 0 in the matched normal, there is no need to relabel it as not normal in the metaData.

Regarding the weird CNA calls, I'd need to have a closer look at the log and/or the output. :/ Maybe you can start with showing me the .log file in your R directory (or the individuals R subdirectory if you run in split mode). It contains the meta data of your run as well as some basic stats on the output, so if you dont feel comfortable sharing that here you can email it to me, or I can help you extract the relevant part.

cheers, /Christoffer

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ChristofferFlensburg/superFreq/issues/1#issuecomment-301972987, or mute the thread https://github.com/notifications/unsubscribe-auth/AFzoBhLQdXD3UG8b_9ExbGrqm5kP_QI9ks5r6mSjgaJpZM4NczTm .

ChristofferFlensburg commented 7 years ago

Relabeling your matched normal in the meta data will not address the issue of having the somatic SNVs in the pool of reference normals. They are two different problem. Somatic SNVs in the reference normals will make the SNVs get filtered (non dbSNP SNV present in reference normals --> most likely noise) and will show up as dim orange circles as you saw. To fix that, you need to identify and remove the reference normal(s) containing the somatic SNVs (it'll be the normal from the same individual most likely).

The matched normal (the one in the metadata and scatters) being impure doesn't filter the variants (so they would stay solid red dot), but they don't get marked as somatic if they are present to too large degree in the matched normal. You can see this from the dots departing from the axis in the scatter.

So they are two different problem with two different solutions. You said that your SNVs are filtered (orange circles), so you'll have to remove the violating reference normal(s) from the pool for sure. If your matched normal (the one in the meta data file) also has more than a few % of cancer content, then you probably want to relabel it with "NO" as well.

I've worked a lot with leukemia, so I understand your problem of not having good matched normals perfectly. It's the reason why I built the tool to support cases without good matched normal to start with. :)

You can send to flensburg.c@wehi.edu.au.

ChristofferFlensburg commented 7 years ago

Just an update for the record. We tracked the issue to missing bam index files (.bai) for the reference normals, which didn't throw an error as intended, but instead decided to filter all the SNVs and go ahead with the CNAs (called from read depth only) alone in the clonal tracking. There is a stop() at analyse.R line 517 that should have triggered (the log entry from line 516 was present in the log file), so not immediately clear what went wrong. I aim to try to reproduce and fix the issue this or next week.

ChristofferFlensburg commented 7 years ago

I added a check at initialisation for the reference normal bam indexes, so it should exit early on with a meaningful error message in this case. This was a few version ago I think, only that I forgot to close the issue. :)

let me know if there are any further issues.