freeseek / gtc2vcf

Tools to convert Illumina IDAT/BPM/EGT/GTC and Affymetrix CEL/CHP files to VCF
MIT License
131 stars 22 forks source link

Affy2vcf snp-posterior file #58

Open IPriscilla opened 8 months ago

IPriscilla commented 8 months ago

Hi Giulio,

The array Finngen2 Axiom custom array from ThermoFisher was used for genotyping in our project. We would like to convert the .CEL files to a .vcf file to run the MoChA pipeline. We found your answer on https://www.biostars.org/p/9554233/ about this type of array and implemented this apt-genotype-axiom tool to convert our .CEL files to .txt files as input for the affy2vcf tool with this code:

path_to_output_folder="[PATH_TO_PROJECT]/CHP_files"
cel_list_file="[PATH_TO_PROJECT]/CEL_FILES_PASSEDQC.txt"
ref_files="[PATH_TO_PROJECT]/ref_files/Axiom_FinnGen2_Analysis.r1"

apt-genotype-axiom \
    --analysis-files-path $ref_files/ \
    --arg-file $ref_files/Axiom_FinnGen2_96orMore_Step2.r1.apt-genotype-axiom.AxiomGT1.apt2.xml \
    --do-rare-het-adjustment true \
    --out-dir $path_to_output_folder \
    --cel-files $cel_list_file \
    --log-file $path_to_output_folder/genotyping.log \
    --genotyping-node:snp-posteriors-output true

However, using this option --genotyping-node:snp-posteriors-output true, a lot of SNPs are lost in the conversion in the AxiomGT1.calls.txt (n=217,259 SNPs with this --genotyping-node:snp-posteriors-output true option and n=643,408 SNPs without this option). This is the parameter for this file in the .xml file: --multi-genotyping-node:multi-posteriors-output true.

Without this --genotyping-node:snp-posteriors-output true option, no AxiomGT1.snp-posteriors.txt file was created but only the AxiomGT1.snp-posteriors.multi.txt file. Is it possible to run the affy2vcf tool with only the AxiomGT1.snp-posteriors.multi.txt file? Or do I need the AxiomGT1.snp-posteriors.txt as well?

This is how we run the affy2vcf tool:

csv_manifest_file="[PATH_TO_PROJECT]/Axiom_FinnGen2.na36.r1.a1.annot.b37.csv"
ref="[PATH_TO_PROJECT]/human_g1k_v37.fasta"
brondir='[PATH_TO_PROJECT]/CHP_files'
targetdir='[PATH_TO_PROJECT]/1.CHP_to_VCF_conversion'
pfx='converted'

bcftools +affy2vcf \
  --csv $csv_manifest_file \
  --fasta-ref $ref \
  --calls $brondir/AxiomGT1.calls.txt \
  --confidences $brondir/AxiomGT1.confidences.txt \
  --summary $brondir/AxiomGT1.summary.txt \
  --snp $brondir/AxiomGT1.snp-posteriors.multi.txt | \
bcftools sort -Ou -T ./bcftools. | \
bcftools norm --no-version -Ob -c x -f $ref | \
tee $pfx.bcf | \
bcftools index --force --output $pfx.bcf.csi

If we use the 'AxiomGT1.snp-posteriors.multi.txt' file, we get this error:

Could not read VCF/BCF headers from -
Cleaning
Failed to read from standard input: unknown file type
index: "-" is in a format that cannot be usefully indexed

Our questions are:

freeseek commented 8 months ago

Unfortunately Affymetrix does not do a good job of providing Axiom array data for developers to test their code. I am not sure what the problem with your genotype losses is due to. The option --genotyping-node:snp-posteriors-output true should really not change anything other than providing an additional output. This might be an Affymetrix bug or maybe your command died while running. The error you show for BCFtools/affy2vcf is likely related to the first command having an issue. Maybe try rerunning without piping into BCFtools/sort to see what problem you get. I also don't understand why you have a .snp-posteriors.multi.txt rather than a .snp-posteriors.txt file. Maybe the format is slightly different and that is what is causing the problem. Would it be possible for me to have access to some of these file to test on my end? I would be happy to update the code to make it work as it is most likely a minor format issue if it is an issue with BCFtools/affy2vcf at all.

IPriscilla commented 8 months ago

Thanks for your fast reply! Without piping into BCFtools/sort I got the same error. First, I will try to run the script again and give it more memory and time. I did not get an error because of time or memory limits but maybe it will fix the problem. If that is not the case, I will discuss with my collegues about sharing some (or parts) of our files.

freeseek commented 8 months ago

I highly doubt that without piping into BCFtools/sort you still get the error:

Could not read VCF/BCF headers from -

As this is an error message from BCFtools/sort

IPriscilla commented 8 months ago

Ah yes I did something wrong. This is the error from the affy2vcf tool if I use the AxiomGT1.snp-posteriors.multi.txt file as input:

Reading SNP posteriors file /groups/umcg-lifelines/tmp01/projects/ov19_0503/MoChA_pipeline_UGLI2/0.CEL_to_CHP_conversion/CHP_files_genotyping/AxiomGT1.snp-posteriors.multi.txt
Malformed header line in SNP model file /groups/umcg-lifelines/tmp01/projects/ov19_0503/MoChA_pipeline_UGLI2/0.CEL_to_CHP_conversion/CHP_files_genotyping/AxiomGT1.snp-posteriors.multi.txt:
probeset_id     copynumber      nAlleles        cluster mean    nObsMean        covariance      nObsVar

If I open the AxiomGT1.snp-posteriors.multi.txt file this is the header:

probeset_id copynumber  nAlleles    cluster mean    nObsMean    covariance  nObsVar
id_1    2   3   AA  13.138457,11.458281,11.744538   0.200000    0.030000,0.000000,0.000000,0.030000,0.000000,0.030000   1.000000

and this is the header of the file AxiomGT1.snp-posteriors.txt:

id  BB  AB  AA  CV
id_1    -2.861181,0.037860,8476.672760,8476.672760,9.261750,0.068677,-0.020879  0.024676,0.037860,13887.279746,13887.279746,10.130948,0.068677,-0.003455    2.733218,0.037860,5945.047494,5945.047494,9.343756,0.068677,0.005921    -0.1,-0.1,-0.1,-0.05,-0.1,-0.05,0,0,0,0,0,0

I think these files contain different information?

freeseek commented 8 months ago

This does not seem to make much sense. Are you sure your header isn't something like this for AxiomGT1.snp-posteriors.txt:

id  BB  AB  AA  CV
id_1    -2.861181,0.037860,8476.672760,8476.672760,9.261750,0.068677,-0.020879  0.024676,0.037860,13887.279746,13887.279746,10.130948,0.068677,-0.003455    2.733218,0.037860,5945.047494,5945.047494,9.343756,0.068677,0.005921    -0.1,-0.1,-0.1,-0.05,-0.1,-0.05,0,0,0,0,0,0

It just does not seem like the header and what comes after that have the same number of columns, which would definitely be a bug. What version of Affymetrix Power Tools are you using?

Also, if you have both the AxiomGT1.snp-posteriors.txt and the AxiomGT1.snp-posteriors.multi.txt files, why are you using the latter with BCFtools/affy2vcf?

IPriscilla commented 8 months ago

Yes indeed, I checked it and something went wrong with copying the head of this file. I changed it in my previous post.

The problem is when I use the option --genotyping-node:snp-posteriors-output true, a lot of SNPs disappear in conversion. When I leave this option out, I only have the AxiomGT1.snp-posteriors.multi.txt file, which cannot be used as input for the affy2vcf tool (as described in the first post).

I'm running the apt-genotype-axiom tool now again with more memory and time but it will take some days before I have the result. Hopefully, altough I did not get a memory/time limit error, this will fix the problem of losing so much SNPs in the conversion. I will keep you updated.

freeseek commented 8 months ago

Can you also share the correct first two lines of the AxiomGT1.snp-posteriors.multi.txt file? As it stands edited at this moment it seems as if the files AxiomGT1.snp-posteriors.txt and AxiomGT1.snp-posteriors.multi.txt files have the same data on the second line. I want to understand whether it would be possible to parse the AxiomGT1.snp-posteriors.multi.txt file

IPriscilla commented 8 months ago

The previous comment is updated with the header from the AxiomGT1.snp-posteriors.multi.txt file.

freeseek commented 8 months ago

It is still extremely puzzling as the header has seven columns and the first line of the table has eight columns. And it is quite unclear what column the AA entry should correspond to