freeseek / gtc2vcf

Tools to convert Illumina IDAT/BPM/EGT/GTC and Affymetrix CEL/CHP files to VCF
MIT License
140 stars 24 forks source link

empty vcf final file #57

Closed sgomes-catarina closed 1 year ago

sgomes-catarina commented 1 year ago

Hello,

I'm running the gtc2vcf plugin on 541 gtc files as below:

# Set variables
bpm_manifest_file="/mnt/p2000/home/illumina_gda/raw/nov_2022_LTO/diag/GDA-8v1-0_D1.bpm"
egt_cluster_file="/mnt/p2000/home/illumina_gda/raw/nov_2022_LTO/diag/GDA-8v1-0_D1_ClusterFile.egt"
csv_manifest_file="/mnt/p2000/home/illumina_gda/raw/nov_2022_LTO/diag/GDA-8v1-0_D1.csv"

path_to_gtc_folder="/mnt/p2000/data4/projects/illumina_gda/intermediate/01_gencall_idat_to_gtc"
path_to_output_folder="/mnt/p2000/data4/projects/illumina_gda/intermediate/02_gtc_to_vcf/"
ref="/mnt/p2000/data4/projects/illumina_gda/ref/hg37/human_g1k_v37.fasta"
out_prefix="ROC_GDA-8v1-0_D1_multisample_gtc2vcf"
LANG="en_US.UTF-8"

cd $path_to_output_folder

# Call bcftools gtc to vcf
bcftools +gtc2vcf \
  --no-version -Ou \
  --bpm $bpm_manifest_file \
  --csv $csv_manifest_file \
  --egt $egt_cluster_file \
  --gtcs $path_to_gtc_folder \
  --fasta-ref $ref \
  --extra $out_prefix.tsv | \
  bcftools sort -Ou -T ./bcftools. | \
  bcftools norm --no-version -Ob -c x -f $ref | \
  tee $out_prefix.bcf | \
  bcftools index --force $out_prefix.bcf.csi

The final lines of my log file are:

Reading GTC file /mnt/p2000/data4/projects/illumina_gda/intermediate/01_gencall_idat_to_gtc/206425690038_R02C01.gtc
Reading GTC file /mnt/p2000/data4/projects/illumina_gda/intermediate/01_gencall_idat_to_gtc/206425690038_R03C01.gtc
Reading GTC file /mnt/p2000/data4/projects/illumina_gda/intermediate/01_gencall_idat_to_gtc/206425690038_R04C01.gtc
Reading GTC file /mnt/p2000/data4/projects/illumina_gda/intermediate/01_gencall_idat_to_gtc/206425690038_R05C01.gtc
Reading GTC file /mnt/p2000/data4/projects/illumina_gda/intermediate/01_gencall_idat_to_gtc/206425690038_R06C01.gtc
Reading GTC file /mnt/p2000/data4/projects/illumina_gda/intermediate/01_gencall_idat_to_gtc/206425690038_R07C01.gtc
Reading GTC file /mnt/p2000/data4/projects/illumina_gda/intermediate/01_gencall_idat_to_gtc/206425690038_R08C01.gtc
Writing VCF file
Lines   total/missing-reference/skipped:    1904599/3/107
Merging 54 temporary files

The final BCF file is empty, but 54 individual bcf files were created in a subdirectory "bcftools.irknRH". Is the job terminating before expected due to lack of memory? I'm running it using 24GB in one node and 20 tasks per node.

Thank you in advance for your support.

freeseek commented 1 year ago

The subdirectory bcftools.irknRH is created by bcftools sort to store temporary sorted files. The message Merging 54 temporary files means that bcftools sort completed the sort and was in the processing of merging the temporary output but it must have then crashed before completing for some reasons. Could it be that you don't have space left on the filesystem to write the $out_prefix.bcf file? Try to remove the last three lines and use bcftools sort -Ob -o $out_prefix.bcf -T ./bcftools. to see what happens. Also, please don't use GRCh37. You can realign the GDA-8v1-0_D1.csv manifest file as explained here and then use the output in gtc2vcf with the option --sam-flank

sgomes-catarina commented 11 months ago

Thank you for your reply @freeseek ! I do have free disk space to write the $out_prefix.bcf file. I'm trying to reprocess it using more RAM memory to see what happens. I've realigned the csv manifest file but when running gtc2vcf should I use the GRCh37 or GRCh38 as reference genome?

Thank you again for your support!

freeseek commented 11 months ago

It is useless to add more RAM. The BCFtools/gtc2vcf plugin requires ~3GB for the reference to be cached in memory, and some more memory to load into memory the BPM/CSV/EGT files (also notice BCFtools/gtc2vcf is only multi-threaded for compression/decompression, not for the GTC to VCF conversion itself). BCFtools/sort, which had progressed to the final step when it output:

Merging 54 temporary files

Requires by default a maximum of 768MB. Something else is causing the last step to fail. Maybe BCFtools/norm is failing. You might need to try by iteratively removing the last steps to see what is going on. What happens if you try the same step with one or two GTC files only? Does it complete?

If you are converting to VCF using only the --csv option you need to use the reference of the CSV manifest file. If you are using both the --csv and the --sam-flank options, then you need to use the reference used for the aligned SAM/BAM file. If you are wondering what you should realign your manifest file against, then my advice is to always prefer GRCh38 over GRCh37. There should be no reasons left to use GRCh37 as of today