Clinical-Genomics / genotype

Simple genotype comparison of VCF files
http://taboo.readthedocs.org/en/latest/
MIT License
8 stars 2 forks source link

Not matching CG sequencing no-call (unknown) variants with MAF results #69

Open SofiaOtero opened 1 year ago

SofiaOtero commented 1 year ago

Description

During the investigation of MAF results we noticed that Genotype is only showing the MAF variant calls and not the sequence calls from our bcf file. Examples of this is described in sheet. In sheet there are examples of samples from out bcl files with many no-calls, which do not have many no-calls from MAF.

Solution

Update the sequence - sequence matching in genotype so we can see the actual amount of no-calls (Unknown) that are present in our bcf files.

Screenshot 2023-09-15 at 12 50 55

In the actual plates it is also documented that the sample has the amount of matches based on the calls from MAF and not the actual matches to our sequences. This would cause that we would have a lot of samples that do not pass the threshold of max 15 no-calls that we have right now. So it should be considered if we should make an unknown column for MAF and for our calls and if we will pass the sample based on solely the MAF results, because as for now were are passing all samples that have no-calls < 15 from MAF.

Screenshot 2023-09-15 at 12 47 53

This can be closed when

Genotype has been updated and we decide if samples pass based on only MAF results as currently.

SofiaOtero commented 1 year ago

@beatrizsavinhas and I have been looking into the snps that are not present in our bcf file but is present in the MAF results for 3 samples and here it shows that the variants that are not present in our file are REF/REF calls from MAF.

Apparently we do not report homozygous reference calls REF/REF in our file. Could be nice to check how this file is created and why we do not add the variant calls for REF/REF.

Vince-janv commented 1 year ago

@beatrizsavinhas and I have been looking into the snps that are not present in our bcf file but is present in the MAF results for 3 samples and here it shows that the variants that are not present in our file are REF/REF calls from MAF.

Apparently we do not report homozygous reference calls REF/REF in our file. Could be nice to check how this file is created and why we do not add the variant calls for REF/REF.

Thanks for summarising all this info @SofiaOtero and @beatrizsavinhas! Variant callers (which create the VCFs) only reports locations where the genome deviates from the reference. So no REF/REF positions are found in the VCFs. This basically means that genotype trusts the variant callers and if nothing is reported in the VCF, genotype records it as REF/REF. One can argue whether this is ideal or not.

In the description could you describe when this becomes an issue, how often, and how much time it takes you to verify stuff manually or fix it manually? It will make it easier for us to weigh this issue against others 😊

SofiaOtero commented 1 year ago

That is correct that is what genotype assumes now.

It is not something that has a huge impact on production, but it is not correct to just state it is a match when we are not completely sure about it. We can extract the REF/REF calls from the bam files. Could be a solution to make a genotype_vcf file where we extract all 51 snps that we send to MAF, then we can compare this file with the MAF calls, and see if it is actually a match or if the REF/REF in our sequence did actually not get any reads.

Vince-janv commented 1 year ago

TODO

Vince-janv commented 1 year ago

@SofiaOtero @beatrizsavinhas. I've looked at the code and it should be possible to upload a VCF file containing REF/REF. If you can produce one we can try and upload it. However I do not know the changes needed in the pipelines to produce such a file.

SofiaOtero commented 1 year ago

Is it in genotype or genotype-api that it is created?

Vince-janv commented 1 year ago

In genotype. Here are the relevant lines: https://github.com/Clinical-Genomics/genotype/blob/72c405b6dbaa198d9b724135dcac6307bf47ca87/genotype/load/vcf.py#L28-L43

SofiaOtero commented 1 year ago

Seems like the REF/REF are actually saved if I run the test on:

def load_vcf(vcf_file: str, snps: List[SNP]) -> List[Analysis]:
    """Load genotypes from a BCF/VCF.gz file."""
    vcf = VCF(vcf_file)
    # generate Analysis records for each included sample
    source = os.path.abspath(vcf_file)
    analyses = {
        sample_id: Analysis(type="sequence", source=source, sample_id=sample_id)
        for sample_id in vcf.samples
    }
    for snp in snps:
        variant = fetch_snp(vcf, snp)
        if variant is None:
            # assume REF/REF
            for analysis in itervalues(analyses):
                genotype = Genotype(rsnumber=snp.id, allele_1=snp.ref, allele_2=snp.ref)
                analysis.genotypes.append(genotype)
        else:
            raw_genotypes = variant_genotypes(vcf.samples, variant)
            for raw_gt in raw_genotypes:
                genotype = Genotype(
                    rsnumber=snp.id, allele_1=raw_gt.allele_1, allele_2=raw_gt.allele_2
                )
                analysis = analyses[raw_gt.sample]
                analysis.genotypes.append(genotype)

    for analysis in itervalues(analyses):
        print(f"Sample ID: {analysis.sample_id}")
        for genotype in analysis.genotypes:
            print(
                f"RS Number: {genotype.rsnumber}, Allele 1: {genotype.allele_1}, Allele 2: {genotype.allele_2}"
            )

    return itervalues(analyses)

Output:

tests/load/test_load_vcf.py::test_load_vcf PASSED                        [100%]Sample ID: sample
RS Number: rs1065772, Allele 1: C, Allele 2: T
RS Number: rs3752714, Allele 1: A, Allele 2: A
RS Number: rs26840, Allele 1: C, Allele 2: T
RS Number: rs1037256, Allele 1: G, Allele 2: A
RS Number: rs11010, Allele 1: T, Allele 2: T

snps.grch37.txt file:

rs1065772   C   1   44072018
rs3752714   G   7   1542697
rs26840 C   16  2285357
rs1037256   G   17  71197748
rs11010 T   X   102842041

The last snp rs11010 is REF/REF

So it seems like there is no need for changes if we are just able to produce the VCF containing REF/REF?

beatrizsavinhas commented 11 months ago

Added instructions in the documentation on how to generate a vcf containing REF/REF https://github.com/Clinical-Genomics/atlas/pull/1892.