Illumina / GTCtoVCF

Script to convert GTC/BPM files to VCF
Apache License 2.0
41 stars 30 forks source link

Merging B allele frequency and log R ratio for multiple BPMRecords #19

Closed brianhill11 closed 5 years ago

brianhill11 commented 6 years ago

Hi there,

I'm working on extending your code to allow for the extraction of the B allele frequency and log R ratio as additional FORMAT fields in the VCF file (potential solution for #14 ). I was wondering what your recommended method would be for merging multiple measurements (i.e. B allele frequency and log R ratio) at a particular location?

For context, here is sample code:

def generate_sample_format_info(self, bpm_records, vcf_record, sample_name):
        """
        Get the sample b allele frequency
        Args:
        Returns:
            float: B allele frequency
        """
        # TODO: we need to handle the case where there are multiple BPMs
        idx = bpm_records[0].index_num
        return self._b_allele_freq[idx]

The above code naively returns the B allele frequency of the first BPMRecord, but how should they be merged ideally?

Also, I would be happy to open a pull request if you would like to include this functionality once we've figured out how best to merge these records. Adding this additional information to the VCF is specified using an additional command line parameter. Please let me know if you would be interested.

Thanks, Brian

KelleyRyanM commented 6 years ago

Hi Brian, I'd certainly be interested in working with you on a pull request. The issue of how to merge records is a little bit tricky. In practice, the "B" allele of the different records need not reference the same nucleotide, so that case needs to be considered. Could you give a little bit more detail on your use case and how you would like to use this information downstream? I wonder if we could report a "reference allele" frequency instead of a "b allele" frequency, which may be more clearly defined. -Ryan

On Mon, Jul 30, 2018 at 11:44 AM Brian Hill notifications@github.com wrote:

Hi there,

I'm working on extending your code to allow for the extraction of the B allele frequency and log R ratio as additional FORMAT fields in the VCF file. I was wondering what your recommended method would be for merging multiple measurements (i.e. B allele frequency and log R ratio) at a particular location?

For context, here is sample code:

def generate_sample_format_info(self, bpm_records, vcf_record, sample_name): """ Get the sample b allele frequency Args: Returns: float: B allele frequency """

TODO: we need to handle the case where there are multiple BPMs

    idx = bpm_records[0].index_num
    return self._b_allele_freq[idx]

The above code naively returns the B allele frequency of the first BPMRecord, but how should they be merged ideally?

Also, I would be happy to open a pull request if you would like to include this functionality once we've figured out how best to merge these records. Adding this additional information to the VCF is specified using an additional command line parameter. Please let me know if you would be interested.

Thanks, Brian

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Illumina/GTCtoVCF/issues/19, or mute the thread https://github.com/notifications/unsubscribe-auth/ADUo8FB7RRgHHpcFXH-UrFWLVlJvEGYcks5uL1P2gaJpZM4Vm9cG .

brianhill11 commented 6 years ago

Hi Ryan,

Thanks for the quick response. We would like to do CNV calling in the future, so our goal was to capture this information in the VCF file.

When merging multiple records for creating a GT call, it seems you do some filtering of inconsistent genotypes, and unless there is only one consistent genotype, you return a "no-call". Are these sites with inconsistent genotypes the same sites as you described above?

Thanks, Brian

KelleyRyanM commented 6 years ago

Hi Brian, These would be a different scenario. The "B allele" frequency reported from the GTC is just that, the frequency of the B allele. The manifest associates the A and B alleles with specific nucleotides; however, that association does not need to be the same between different manifest entries. For example, you could have var1 chr1 1000 [C/G] <- B allele frequency refers to frequency of G var2 chr1 1000 [G/C] <- B allele frequency refers to frequency of C For that reason, it may be helpful to first convert the B allele frequency to the frequency of the reference allele before trying to combine information across multiple assays. -Ryan

On Tue, Jul 31, 2018 at 12:46 PM Brian Hill notifications@github.com wrote:

Hi Ryan,

Thanks for the quick response. We would like to do CNV calling in the future, so our goal was to capture this information in the VCF file.

When merging multiple records for creating a GT call, it seems you do some filtering of inconsistent genotypes, and unless there is only one consistent genotype, you return a "no-call". Are these sites with inconsistent genotypes the same sites as you described above?

Thanks, Brian

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Illumina/GTCtoVCF/issues/19#issuecomment-409344986, or mute the thread https://github.com/notifications/unsubscribe-auth/ADUo8EGDqcgEXbb1xarj2ILNLU6Q18-4ks5uMLQygaJpZM4Vm9cG .

brianhill11 commented 6 years ago

Hi Ryan,

Thanks for the example, that makes complete sense. I can take a stab at coding this up. If you have any tips/suggestions, they would be very much appreciated.

Thanks, Brian

brianhill11 commented 6 years ago

Hi Ryan,

I've come across several situations where there are multiple BPM Records, and I was hoping you could help me figure out the best way to merge these records.

Situation 1: same allele, multiple (similar) measurements Example: chr: 6 pos: 145868996 ref: C snp: [T/C] BAF: 0.550747513771 chr: 6 pos: 145868996 ref: C snp: [T/C] BAF: 0.468335241079 This seems to be the simplest example I can find. Should we just average values when we have multiple records? Or what is your suggest method for merging?

Situation 2: same allele, multiple (different) measurements Example: chr: 6 pos: 142630697 ref: C snp: [G/C] BAF: 0.0 chr: 6 pos: 142630697 ref: C snp: [T/C] BAF: 0.991057872772 In this case, I believe we're talking about the same B allele, but the frequencies seem to be very different. Am I reading this wrong? Or is this an erroneous set of measurements?

Situation 3: same A allele, different B allele Example: chr: 6 pos: 35467877 ref: A snp: [A/T] BAF: 0.0 chr: 6 pos: 35467877 ref: A snp: [A/G] BAF: 0.00173915899359 What should we do when there are measurements for two (or more) different B alleles?

Situation 4: different A allele, different B allele Example: chr: 6 pos: 49580247 ref: C snp: [T/C] BAF: 0.999555706978 chr: 6 pos: 49580247 ref: C snp: [A/G] BAF: 1.0 In Situation 3, the A allele matches the reference allele, but in Situation 4, neither A allele matches the reference allele, and the B alleles don't match each other. What should be done in this case?

Situation 5: ? Example: chr: 3 pos: 186575392 ref: G snp: [T/C] BAF: 0.99374049902 chr: 3 pos: 186575392 ref: G snp: [C/G] BAF: 0.521197974682 I'm not quite sure what's going on here, could you help explain?

My guess is that there are other corner-cases, but this is just the subset I've seen so far. If you can think of any other situations, that would be helpful as well. Thanks for your time and help with this, and please let me know if you have any questions for me.

Thanks, Brian

KelleyRyanM commented 6 years ago

In general, I'd probably recommend a median averaging to guard against outliers.

For the situations you've noted, can you first confirm whether the SNPs noted are all on the same strand? I think it would be important to first normalize to the plus strand before trying to compare the B allele frequencies. For example, my guess (?) is that in your situation #4, these SNPs are given on opposite strands. I expect there are some additional complexities in the cases you've noted, but would first be good to clarify the strand issue. -Ryan

On Tue, Aug 7, 2018 at 12:57 PM Brian Hill notifications@github.com wrote:

Hi Ryan,

I've come across several situations where there are multiple BPM Records, and I was hoping you could help me figure out the best way to merge these records.

Situation 1: same allele, multiple (similar) measurements Example: chr: 6 pos: 145868996 ref: C snp: [T/C] BAF: 0.550747513771 chr: 6 pos: 145868996 ref: C snp: [T/C] BAF: 0.468335241079 This seems to be the simplest example I can find. Should we just average values when we have multiple records? Or what is your suggest method for merging?

Situation 2: same allele, multiple (different) measurements Example: chr: 6 pos: 142630697 ref: C snp: [G/C] BAF: 0.0 chr: 6 pos: 142630697 ref: C snp: [T/C] BAF: 0.991057872772 In this case, I believe we're talking about the same B allele, but the frequencies seem to be very different. Am I reading this wrong? Or is this an erroneous set of measurements?

Situation 3: same A allele, different B allele Example: chr: 6 pos: 35467877 ref: A snp: [A/T] BAF: 0.0 chr: 6 pos: 35467877 ref: A snp: [A/G] BAF: 0.00173915899359 What should we do when there are measurements for two (or more) different B alleles?

Situation 4: different A allele, different B allele Example: chr: 6 pos: 49580247 ref: C snp: [T/C] BAF: 0.999555706978 chr: 6 pos: 49580247 ref: C snp: [A/G] BAF: 1.0 In Situation 3, the A allele matches the reference allele, but in Situation 4, neither A allele matches the reference allele, and the B alleles don't match each other. What should be done in this case?

Situation 5: ? Example: chr: 3 pos: 186575392 ref: G snp: [T/C] BAF: 0.99374049902 chr: 3 pos: 186575392 ref: G snp: [C/G] BAF: 0.521197974682 I'm not quite sure what's going on here, could you help explain?

My guess is that there are other corner-cases, but this is just the subset I've seen so far. If you can think of any other situations, that would be helpful as well. Thanks for your time and help with this, and please let me know if you have any questions for me.

Thanks, Brian

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Illumina/GTCtoVCF/issues/19#issuecomment-411181948, or mute the thread https://github.com/notifications/unsubscribe-auth/ADUo8CVBkldih8ZaBz_4_9B1K1Ntz2pDks5uOfE9gaJpZM4Vm9cG .

brianhill11 commented 6 years ago

Hi Ryan,

Thanks again for the help. I've added strand to the output now, and I think your guess was correct. I think normalizing to plus strand should fix Situation 2 and 4.

Situation 1 chr: 6 pos: 145868996 ref: C snp: [T/C] strand: 1 BAF: 0.550747513771 chr: 6 pos: 145868996 ref: C snp: [T/C] strand: 1 BAF: 0.468335241079

Situation 2 chr: 6 pos: 142630697 ref: C snp: [G/C] strand: 2 BAF: 0.0 chr: 6 pos: 142630697 ref: C snp: [T/C] strand: 1 BAF: 0.991057872772

Should normalize to: chr: 6 pos: 142630697 ref: C snp: [G/C -> C/G] strand: 2->1 BAF: 0.0 ?

Situation 3 chr: 6 pos: 35467877 ref: A snp: [A/T] strand: 1 BAF: 0.0 chr: 6 pos: 35467877 ref: A snp: [A/G] strand: 1 BAF: 0.00173915899359

Situation 4 chr: 6 pos: 49580247 ref: C snp: [T/C] strand: 1 BAF: 0.999555706978 chr: 6 pos: 49580247 ref: C snp: [A/G] strand: 2 BAF: 1.0

Should normalize to: chr: 6 pos: 49580247 ref: C snp: [A/G -> T/C] strand: 2->1 BAF: 1.0 ?

Situation 5 chr: 3 pos: 186575392 ref: G snp: [T/C] strand: 2 BAF: 0.99374049902 chr: 3 pos: 186575392 ref: G snp: [C/G] strand: 2 BAF: 0.521197974682

Any ideas about Situation 3 and 5?

Thanks, Brian

KelleyRyanM commented 6 years ago

In situations 3 and 5, the locus is tri-allelic. While I could imagine some potential solutions to this case, it would add a fair amount of complexity. In most array designs, these situations are rare enough that it's likely to not be worth the trouble. At least as a first pass, I would return a no-value (".") for these. -Ryan

On Wed, Aug 8, 2018 at 11:11 AM Brian Hill notifications@github.com wrote:

Hi Ryan,

Thanks again for the help. I've added strand to the output now, and I think your guess was correct. I think normalizing to plus strand should fix Situation 2 and 4.

Situation 1 chr: 6 pos: 145868996 ref: C snp: [T/C] strand: 1 BAF: 0.550747513771 chr: 6 pos: 145868996 ref: C snp: [T/C] strand: 1 BAF: 0.468335241079

Situation 2 chr: 6 pos: 142630697 ref: C snp: [G/C] strand: 2 BAF: 0.0 chr: 6 pos: 142630697 ref: C snp: [T/C] strand: 1 BAF: 0.991057872772

Should normalize to: chr: 6 pos: 142630697 ref: C snp: [G/C -> C/G] strand: 2->1 BAF: 0.0 ?

Situation 3 chr: 6 pos: 35467877 ref: A snp: [A/T] strand: 1 BAF: 0.0 chr: 6 pos: 35467877 ref: A snp: [A/G] strand: 1 BAF: 0.00173915899359

Situation 4 chr: 6 pos: 49580247 ref: C snp: [T/C] strand: 1 BAF: 0.999555706978 chr: 6 pos: 49580247 ref: C snp: [A/G] strand: 2 BAF: 1.0

Should normalize to: chr: 6 pos: 49580247 ref: C snp: [A/G -> T/C] strand: 2->1 BAF: 1.0 ?

Situation 5 chr: 3 pos: 186575392 ref: G snp: [T/C] strand: 2 BAF: 0.99374049902 chr: 3 pos: 186575392 ref: G snp: [C/G] strand: 2 BAF: 0.521197974682

Any ideas about Situation 3 and 5?

Thanks, Brian

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Illumina/GTCtoVCF/issues/19#issuecomment-411500716, or mute the thread https://github.com/notifications/unsubscribe-auth/ADUo8CvPvNW6geOtz5q7u9MeiVPUlFxSks5uOynNgaJpZM4Vm9cG .

brianhill11 commented 6 years ago

Hi Ryan,

I have a first round of implementation for the B allele frequency here

Should a similar approach be taken for the log R ratio?

Thanks, Brian

KelleyRyanM commented 6 years ago

That looks reasonable to me. I have some minor comments, but that could wait for the pull request. The log R ratio should be similar, but you shouldn't need to worry about translation to the reference allele.

On Fri, Aug 10, 2018 at 5:58 PM Brian Hill notifications@github.com wrote:

Hi Ryan,

I have a first round of implementation for the B allele frequency here https://github.com/OmicsDataAutomation/GTCtoVCF/blob/additional_fields/BAlleleFreqFormat.py

Should a similar approach be taken for the log R ratio?

Thanks, Brian

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Illumina/GTCtoVCF/issues/19#issuecomment-412239737, or mute the thread https://github.com/notifications/unsubscribe-auth/ADUo8BdIJcG2yPqqODk4dqWYYnNNCU8Uks5uPixTgaJpZM4Vm9cG .

brianhill11 commented 6 years ago

Hi Ryan,

I added in code for the log R ratio, and opened up a pull request #20 . Please let me know if you have questions/concerns, and I'll work out any changes you would like.

Thanks, Brian

jjzieve commented 5 years ago

Log R Ratio and B Allele Frequence code is merged in 1.2.0, thanks @brianhill11 for the PR and @KelleyRyanM for advice