PapenfussLab / gridss

GRIDSS: the Genomic Rearrangement IDentification Software Suite
Other
255 stars 71 forks source link

which R version should I install and what's the detailed process of somatic.R? #118

Closed WinterLi1993 closed 6 years ago

WinterLi1993 commented 6 years ago

Hi, I have tried R 3.4.4 version, but can not install the R dependencies of Gridss ? could you please provide the python version of somatic.R at the "example" folder ? or give me some details about how to filter the origin somatic vcf . thank you !

d-cameron commented 6 years ago

GRIDSS itself only requires base R. The example script uses by StructuralVariantAnnotation to filter the GRIDSS output.

I am unable to assist in installing R dependencies of StructuralVariantAnnotation if no information regarding the error(s) encountered as supplied. I have previously had trouble updating some of the upstream BioConductor packages but these were resolved with a clean install. Have you tried that?

install.package(c("stringr", "devtools"))
source("https://bioconductor.org/biocLite.R")
biocLite("VariantAnnotation")
install_github("PapenfussLab/StructuralVariantAnnotation")
d-cameron commented 6 years ago

or give me some details about how to filter the origin somatic vcf

Somatic calls will have the genotype QUAL field set to 0 in the normal. You should also apply a minimum QUAL threshold (threshold depends on your coverage) as just using normal QUAL = 0 will germline calls that just happen to have no coverage in the normal.

Somatic LoH calls will have the genotype QUAL field set to 0 in the tumour (somatic LoH variants occur on somatically deleted haplotypes).

WinterLi1993 commented 6 years ago

@d-cameron
for somatic calling : the two breakpoints should be : chr1: 154142762 ,chr6:117642093, but I check the gridss result ,I got these result : chr1 154142762 gridss15_2002o A ]chr6:117642092]A 52712.28 NO_ASSEMBLY chr6 117642092 gridss15_2002h C C[chr1:154142762[ 49799.37 NO_ASSEMBLY

the position is less than 1bp ,,,,the gridss result is wrong ?? if not ,how to convert gridss result into the format of my result ?

thank you!

WinterLi1993 commented 6 years ago

@d-cameron
for somatic calling situation : if use somatic.R , the two breakpoints will be filtered because the FILTER flag is "NO_ASSEMBLY", not [ . ,PASS] . but the two breakpoints is true positive in this sample ... we need to change the method to filter for gridss ?

WinterLi1993 commented 6 years ago

@d-cameron I checked your somatic.R souce code ,I find you to filter low quality calls by this code : vcf <- vcf[rowRanges(vcf)$FILTER %in% c(".", "PASS"),] why not to keep SINGLE_ASSEMBLY and SINGLE_SUPPORT ??

WinterLi1993 commented 6 years ago

@d-cameron when I try your command , errors report : Error: package or namespace load failed for ‘RCurl’ in dyn.load(file, DLLpath = DLLpath, ...): unable to load shared object '/usr/lib64/R/library/RCurl/libs/RCurl.so': libicui18n.so.58: cannot open shared object file: No such file or directory

d-cameron commented 6 years ago

Error: package or namespace load failed for ‘RCurl’

I recommend trying StackOverflow for such errors as I am not very familar with the RCurl package.

d-cameron commented 6 years ago

why not to keep SINGLE_ASSEMBLY and SINGLE_SUPPORT ??

Depends on how conservative you want your call set to be. SINGLE_ASSEMBLY is usually fine but I've found NO_ASSEMBLY to have a higher false positive rate.

d-cameron commented 6 years ago

for somatic calling :

the two breakpoints should be : chr1: 154142762 ,chr6:117642093, but I check the gridss result ,I got these result : chr1 154142762 gridss15_2002o A ]chr6:117642092]A 52712.28 NO_ASSEMBLY chr6 117642092 gridss15_2002h C C[chr1:154142762[ 49799.37 NO_ASSEMBLY

The GRIDSS result is correct. I strongly recommend reading Section 5.4 of the VCF file format specifications as that describes the VCF breakpoint notation used by GRIDSS.

the position is less than 1bp ,,,,the gridss result is wrong ??

This just means that GRIDSS is calling an exact position (unless of course, CIPOS is also specified).

d-cameron commented 6 years ago

but the two breakpoints is true positive in this sample ... chr1 154142762 gridss15_2002o A ]chr6:117642092]A 52712.28 NO_ASSEMBLY we need to change the method to filter for gridss ?

What sequencing depth do you have? A qual score of 50000 is extremely high so it's very likely that there's no assembly because your coverage is so deep and GRIDSS aborted the assembly attempt to due excessive coverage.

In your case, I would definitely change the filtering criteria to include NO_ASSEMBLY calls.

WinterLi1993 commented 6 years ago

"but the two breakpoints is true positive in this sample ... chr1 154142762 gridss15_2002o A ]chr6:117642092]A 52712.28 NO_ASSEMBLY <we need to change the method to filter for gridss ?" What sequencing depth do you have? A qual score of 50000 is extremely high so it's very likely that there's no assembly because your coverage is so deep and GRIDSS aborted the assembly attempt to due excessive coverage. In your case, I would definitely change the filtering criteria to include NO_ASSEMBLY calls.> @d-cameron
tumor sample average depth is 2782 ,and normal sample is 2134 . the somatic.R of gridss will filter this true variant , so what should I do to keep this true variant ? I should change the filter process ?? could you please give some details to filter ? thank you !

WinterLi1993 commented 6 years ago

“for somatic calling : the two breakpoints should be : chr1: 154142762 ,chr6:117642093, but I check the gridss result ,I got these result : chr1 154142762 gridss15_2002o A ]chr6:117642092]A 52712.28 NO_ASSEMBLY chr6 117642092 gridss15_2002h C C[chr1:154142762[ 49799.37 NO_ASSEMBLY The GRIDSS result is correct. I strongly recommend reading Section 5.4 of the VCF file format specifications as that describes the VCF breakpoint notation used by GRIDSS. the position is less than 1bp ,,,,the gridss result is wrong ?? ”

《This just means that GRIDSS is calling an exact position (unless of course, CIPOS is also specified).》 @d-cameron
chr1 2489026 gridss0_538o ]chr1:2489085]G CIPOS=0,1;CIRPOS=-1,0 ; chr1 2489085 gridss0_538h G[chr1:2489026[ CIPOS=-1,0;CIRPOS=0,1;

CIPOS has two number : 0 and 1 , what does the 0 means ? what does the 1 means ?

thank you very much !

WinterLi1993 commented 6 years ago

"or give me some details about how to filter the origin somatic vcf" 《Somatic calls will have the genotype QUAL field set to 0 in the normal. You should also apply a minimum QUAL threshold (threshold depends on your coverage) as just using normal QUAL = 0 will germline calls that just happen to have no coverage in the normal. Somatic LoH calls will have the genotype QUAL field set to 0 in the tumour (somatic LoH variants occur on somatically deleted haplotypes). 》 @d-cameron could you please provide me some details the relationship between Coverage and Qual field set in normal ?for example , if coverage is 1000, Qual should be 0 .

d-cameron commented 6 years ago

could you please provide me some details the relationship between Coverage and Qual field set in normal ?for example , if coverage is 1000, Qual should be 0 .

The GRIDSS manuscript (http://genome.cshlp.org/content/early/2017/11/02/gr.222109.117.abstract open access) provides details on how QUAL is calculated. There is a linear relationship between increased coverage and increased QUAL.

d-cameron commented 6 years ago

CIPOS has two number : 0 and 1 , what does the 0 means ? what does the 1 means ?

This is covered in Section 3 of the VCF specifications. In this case, a CIPOS interval of [0,1] indicates a 1 base pair micro-homology.

d-cameron commented 6 years ago

I should change the filter process ?? could you please give some details to filter ?

You have very deep coverage and should definitely adapt the filter criteria to better match your data. I can't give you exact details of exactly what filtering you should use as that will depends on your experimental design and the scientific question you are attempting to answer.

I recommend plotting tumour vs normal QUAL (log vs log) to see how well separated your somatic calls are. At such high depth, tumour contamination in your normal may be visible. Setting an allowable contamination level (e.g. 3%) on somatic calling my be prudent.

Low VAF calls are more likely to be FPs. That said, given you've sequenced to 3000x, I expect that you're probably interested in tumour heterogeneity and subclonality.

The GRIDSS assembly process downsamples high coverage regions then abandons assembly when sufficiently complex. Since this complexity check is performed prior to error correction, it's likely that many (most?) of your true positives will have NO_ASSEMBLY because at that depth, you're likely to have enough sequencing for the assembly to consider the region a problematic one. In your case, I recommend ignoring the assembly-related VCF FILTERs.

Have a look at the GRIDSS VCF header. GRIDSS outputs a lot of information about each of the variant calls. Some of these fields may be very relevant to your particular analysis.

Good luck :)

d-cameron commented 6 years ago

My mistake. It was the same file as the VAF function, I just didn't include it. Here is an adapted version of the somatic filtering I am using for a WGS pan-cancer project I am involved in:

gridss.short_deldup_size_threshold = 1000
gridss.allowable_normal_contamination=0.03
gridss.use_read_threshold=3
gridss.max_homology_length = 50
#' For each GRanges breakend, indicates whether the variant
#' should be filtered
#' @param somatic_filters apply somatic filters.
#' Assumes the normal and tumour samples are the first and second respectively
gridss_filter = function(gr, vcf, somatic_filters=TRUE) {
    vcf = vcf[names(gr)]
    i = info(vcf)
    g = geno(vcf)
    isShort = is_short_deldup(gr)
    ihomlen = rowSums(abs(as.matrix(info(vcf)$IHOMPOS)))
    ihomlen[is.na(ihomlen)] = 0
    filtered = str_detect(gr$FILTER, "NO_ASSEMBLY") |
        str_detect(gr$FILTER, "LOW_QUAL") | # exactly the same as QUAL >= 500
        ihomlen > gridss.max_homology_length |
        (!isShort & i$RP == 0) |
        (isShort & i$SR == 0)
    if (somatic_filters) {
        # see https://github.com/PapenfussLab/gridss/issues/114 for why we can't default to using QUAL
        support = g$SR
        rpsupport = g$RP
        rpsupport[as.logical(is_short_deldup(gr))] = 0
        support = support + rpsupport
        has_sufficient_read_support = rowSums(support) >= gridss.use_read_threshold
        filtered = filtered |
            (has_sufficient_read_support & support[,1] > 0.03 * support[,2]) |
            (!has_sufficient_read_support & g$QUAL[,1] > 0.03 * g$QUAL[,2])
            #(isShort & g$SR[,1] != 0)
    }
    return(as.logical(filtered))
}
is_short_deldup = function(gr) {
    strand(gr) != strand(partner(gr)) &
        seqnames(gr) == seqnames(partner(gr)) &
        abs(start(gr)-start(partner(gr))) < gridss.short_deldup_size_threshold
}

The gr parameter is called by gr = breakpointRanges(vcf) (or a any subset thereof) and the result is a vector of the GRanges breakend filters.

d-cameron commented 6 years ago

I also recommend running gridss with the CONFIGURATION_FILE=gridss.properties command line option with the following contents of gridss.properties:

# The default max of 10000 is way too low for 3000x sequencing - any decent somatic copy number gains will push the coverage above 10000
maxCoverage = 100000
WinterLi1993 commented 6 years ago

@d-cameron you means just like this ??:
java -ea -Xmx31g \ -Dsamjdk.create_index=true \ -Dsamjdk.use_async_io_read_samtools=true \ -Dsamjdk.use_async_io_write_samtools=true \ -Dsamjdk.use_async_io_write_tribble=true \ -Dsamjdk.compression_level=1 \ -cp $GRIDSS_JAR gridss.CallVariants \ CONFIGURATION_FILE=gridss.properties \ maxCoverage = 100000 \ .... it is right ?

WinterLi1993 commented 6 years ago

@d-cameron for filtering function. my project is target panel sequencing. sample type is cfDNA(tumor) , or tissue(tumor) , bloodDNA(normal) , tissue(normal). your somatic filtering function is suitable ? or should I adjust some parameters ? @d-cameron

thanks !

d-cameron commented 6 years ago

it is right ?

CONFIGURATION_FILE=./gridss.properties is the command-line argument.

gridss.properties is a file that needs to be created. The default gridss.properties configuration can be found at https://github.com/PapenfussLab/gridss/blob/master/src/main/resources/gridss.properties . Only options that differ from the default need be specified in the custom gridss.properties file.

WinterLi1993 commented 6 years ago

@d-cameron

I check your filtering code , str_detect(gr$FILTER, "NO_ASSEMBLY") you will filter the record marked by "NO_ASSEMBLY". but in fact , for high coverage sequencing ,this record should be not filtered as I said and posted before . So something need to be changed ?

d-cameron commented 6 years ago

So something need to be changed ?

Yes, many things will be to be changed. I will not be changing the example to suit your analysis since your analysis is not typical. You will need to write your own custom analysis scripts. The example somatic.R is a good starting point, but you will need to write your own filtering code that is suitable to your data and application.

WinterLi1993 commented 6 years ago

@d-cameron is_short_deldup = function(gr) { strand(gr) != strand(partner(gr)) & seqnames(gr) == seqnames(partner(gr)) & abs(start(gr)-start(partner(gr))) < gridss.short_deldup_size_threshold }

Hi, related R package is not installed successfully. So I wanna know the detailed meaning of the code above . how to get the value of "strand(gr) " and "strand(partner(gr))" , which flag ?

thank you in advance ! .

d-cameron commented 6 years ago

how to get the value of "strand(gr) " and "strand(partner(gr))" , which flag ?

This must be parsed from the ALT field using VCF breakend notation. See Section 5.4 of https://samtools.github.io/hts-specs/VCFv4.3.pdf

WinterLi1993 commented 6 years ago

@d-cameron I have checked VCF format . could you please show me an example , or the source code of 'strand' ? I wanna know the strand function how to work .

WinterLi1993 commented 6 years ago

@d-cameron I guess in your result :
deletions are always +-, dups are -+ and inversions are ++ or --.

d-cameron commented 6 years ago

Yes, that interpretation of + and - I have used in the GRanges strand field in my StructuralVariantAnnotation annotation package.

WinterLi1993 commented 6 years ago

@d-cameron I remember that I have told you your StructuralVariantAnnotation annotation package can not install successfully in my system . So I wanna know how to judge strand(gr) and strand(partner(gr)). I wanna reimplement your strand function .