Open SofiaOtero opened 1 year ago
@beatrizsavinhas and I have been looking into the snps that are not present in our bcf file but is present in the MAF results for 3 samples and here it shows that the variants that are not present in our file are REF/REF calls from MAF.
Apparently we do not report homozygous reference calls REF/REF in our file. Could be nice to check how this file is created and why we do not add the variant calls for REF/REF.
@beatrizsavinhas and I have been looking into the snps that are not present in our bcf file but is present in the MAF results for 3 samples and here it shows that the variants that are not present in our file are REF/REF calls from MAF.
Apparently we do not report homozygous reference calls REF/REF in our file. Could be nice to check how this file is created and why we do not add the variant calls for REF/REF.
Thanks for summarising all this info @SofiaOtero and @beatrizsavinhas! Variant callers (which create the VCFs) only reports locations where the genome deviates from the reference. So no REF/REF positions are found in the VCFs. This basically means that genotype trusts the variant callers and if nothing is reported in the VCF, genotype records it as REF/REF. One can argue whether this is ideal or not.
In the description could you describe when this becomes an issue, how often, and how much time it takes you to verify stuff manually or fix it manually? It will make it easier for us to weigh this issue against others 😊
That is correct that is what genotype assumes now.
It is not something that has a huge impact on production, but it is not correct to just state it is a match when we are not completely sure about it. We can extract the REF/REF calls from the bam files. Could be a solution to make a genotype_vcf file where we extract all 51 snps that we send to MAF, then we can compare this file with the MAF calls, and see if it is actually a match or if the REF/REF in our sequence did actually not get any reads.
@SofiaOtero @beatrizsavinhas. I've looked at the code and it should be possible to upload a VCF file containing REF/REF. If you can produce one we can try and upload it. However I do not know the changes needed in the pipelines to produce such a file.
Is it in genotype or genotype-api that it is created?
In genotype. Here are the relevant lines: https://github.com/Clinical-Genomics/genotype/blob/72c405b6dbaa198d9b724135dcac6307bf47ca87/genotype/load/vcf.py#L28-L43
Seems like the REF/REF are actually saved if I run the test on:
def load_vcf(vcf_file: str, snps: List[SNP]) -> List[Analysis]:
"""Load genotypes from a BCF/VCF.gz file."""
vcf = VCF(vcf_file)
# generate Analysis records for each included sample
source = os.path.abspath(vcf_file)
analyses = {
sample_id: Analysis(type="sequence", source=source, sample_id=sample_id)
for sample_id in vcf.samples
}
for snp in snps:
variant = fetch_snp(vcf, snp)
if variant is None:
# assume REF/REF
for analysis in itervalues(analyses):
genotype = Genotype(rsnumber=snp.id, allele_1=snp.ref, allele_2=snp.ref)
analysis.genotypes.append(genotype)
else:
raw_genotypes = variant_genotypes(vcf.samples, variant)
for raw_gt in raw_genotypes:
genotype = Genotype(
rsnumber=snp.id, allele_1=raw_gt.allele_1, allele_2=raw_gt.allele_2
)
analysis = analyses[raw_gt.sample]
analysis.genotypes.append(genotype)
for analysis in itervalues(analyses):
print(f"Sample ID: {analysis.sample_id}")
for genotype in analysis.genotypes:
print(
f"RS Number: {genotype.rsnumber}, Allele 1: {genotype.allele_1}, Allele 2: {genotype.allele_2}"
)
return itervalues(analyses)
Output:
tests/load/test_load_vcf.py::test_load_vcf PASSED [100%]Sample ID: sample
RS Number: rs1065772, Allele 1: C, Allele 2: T
RS Number: rs3752714, Allele 1: A, Allele 2: A
RS Number: rs26840, Allele 1: C, Allele 2: T
RS Number: rs1037256, Allele 1: G, Allele 2: A
RS Number: rs11010, Allele 1: T, Allele 2: T
snps.grch37.txt file:
rs1065772 C 1 44072018
rs3752714 G 7 1542697
rs26840 C 16 2285357
rs1037256 G 17 71197748
rs11010 T X 102842041
The last snp rs11010 is REF/REF
So it seems like there is no need for changes if we are just able to produce the VCF containing REF/REF?
Added instructions in the documentation on how to generate a vcf containing REF/REF https://github.com/Clinical-Genomics/atlas/pull/1892.
Description
During the investigation of MAF results we noticed that Genotype is only showing the MAF variant calls and not the sequence calls from our bcf file. Examples of this is described in sheet. In sheet there are examples of samples from out bcl files with many no-calls, which do not have many no-calls from MAF.
Solution
Update the sequence - sequence matching in genotype so we can see the actual amount of no-calls (Unknown) that are present in our bcf files.
In the actual plates it is also documented that the sample has the amount of matches based on the calls from MAF and not the actual matches to our sequences. This would cause that we would have a lot of samples that do not pass the threshold of max 15 no-calls that we have right now. So it should be considered if we should make an unknown column for MAF and for our calls and if we will pass the sample based on solely the MAF results, because as for now were are passing all samples that have no-calls < 15 from MAF.
This can be closed when
Genotype has been updated and we decide if samples pass based on only MAF results as currently.