Closed seetarajpara closed 1 month ago
Hey @seetarajpara,
Could you share the exact commands you are using in the loop and the final merge, step by step?
Thanks.
@joansaurina yes, here's what I tried:
RUN_DIR='/Users/Seeta/Desktop/BCSB_GDA_array/VCFs/biallelic_vcfs/multisample_vcf'
for f in ${RUN_DIR}/*.vcf
do
bcftools view -m2 -M2 -v snps $f > biallelic_vcfs/$(basename $f | cut -d. -f1).vcf
done
for f in biallelic_vcfs/*.vcf
bgzip $f | tabix -f -p vcf
done
# Combine all VCF files into one
ls *.vcf.gz > vcf_list.txt
# Use bcftools merge with the file list
bcftools merge -Oz -l vcf_list.txt -o multisample_vcf/batch3_biallelic.vcf.gz
I should add, I also confirmed my merged VCF has only SNPs:
(nadmenv) (base) Seeta@MacBook-Pro-USC multisample_vcf % bcftools view -Oz -m 2 -M 2 --types=snps batch3_biallelic_noDups.vcf.gz | bcftools stats - | awk '/^SN/' > test.vcf.gz
SN 0 number of samples: 159
SN 0 number of records: 1127994
SN 0 number of no-ALTs: 0
SN 0 number of SNPs: 1127994
SN 0 number of MNPs: 0
SN 0 number of indels: 0
SN 0 number of others: 0
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
You can try entering python
in a terminal where you have your file and on the neural-admixture conda environment, and then do the following:
import allel
import numpy as np
f_tr = allel.read_vcf('filename') #substitute filename with yours.
calldata = f_tr["calldata/GT"]
A = np.sum(calldata, axis=2).T/2
print('A max:', int(A.max()))
print('A min:', int(A.min()))
If the maximum value is higher than 1, you probably still have multiallelic sites in your data. If the minimum value is lower than 0, your files might be corrupted.
Let us know if this was helpful.
You're right, it looks as though the file is corrupt. I'm not sure at what point this happened, because I tested an individual sample file as well (below)
>>> f_tr = allel.read_vcf('batch3_biallelic_noDups.vcf.gz')
>>> calldata = f_tr["calldata/GT"]
>>> A = np.sum(calldata, axis=2).T/2
>>> print('A max:', int(A.max()))
A max: 2
>>> print('A min:', int(A.min()))
A min: -1
one of the individual samples:
>>> f_tr = allel.read_vcf('159_Salhia_GDA_BCSB0419.vcf.gz')
>>> calldata = f_tr["calldata/GT"]
>>> A = np.sum(calldata, axis=2).T/2
>>> print('A max:', int(A.max()))
A max: 2
>>> print('A min:', int(A.min()))
A min: 0
I will try to merge these individual files and test the tool out again. Thanks for this tip! Will circle back shortly.
ok so I merged 3 of these individual samples, this is where I see the -1 value for the A.min:
>>> f_tr = allel.read_vcf('test_157_158_159.vcf.gz')
>>> calldata = f_tr["calldata/GT"]
>>> A = np.sum(calldata, axis=2).T/2
>>> print('A max:', int(A.max()))
A max: 2
>>> print('A min:', int(A.min()))
A min: -1
Do you have any suggestions for merging VCFs of multiple samples? I think this is where my problem is occurring.
If the software you were using to manipulate the files was indeed the problem, we suggest using PLINK, which can be installed from:
https://www.cog-genomics.org/plink/
After having installed this software you should run the following commands:
# Step 1: Convert each VCF file to PLINK format
# Loop through each VCF file in the directory and convert it to PLINK binary format
for f in ${RUN_DIR}/*.vcf
do
# Use PLINK to convert VCF to PLINK format
./plink --vcf $f --make-bed --out biallelic_vcfs/$(basename $f | cut -d. -f1)
done
# Step 2: Create a list of PLINK files to merge
# List all the .bed files and save the base names (without .bed extension) in plink_file_list.txt
ls biallelic_vcfs/*.bed | sed 's/\.bed$//' > plink_file_list.txt
# Step 3: Merge the PLINK files
# Use PLINK to merge all the PLINK files listed in plink_file_list.txt
./plink --merge-list plink_file_list.txt --make-bed --out multisample_vcf/merged
# Step 4: Filter the merged PLINK file to keep only biallelic SNPs
# Use PLINK to filter the merged file and retain only biallelic SNPs
./plink --bfile multisample_vcf/merged --biallelic-only --make-bed --out multisample_vcf/biallelic_merged
# Step 5: Convert the filtered PLINK file back to VCF format
# If a VCF file is needed, convert the final filtered PLINK file back to VCF format
./plink --bfile multisample_vcf/biallelic_merged --recode vcf --out multisample_vcf/biallelic_merged
Then go back to:
You can try entering
python
in a terminal where you have your file and on the neural-admixture conda environment, and then do the following:import allel import numpy as np f_tr = allel.read_vcf('filename') #substitute filename with yours. calldata = f_tr["calldata/GT"] A = np.sum(calldata, axis=2).T/2 print('A max:', int(A.max())) print('A min:', int(A.min()))
If the maximum value is higher than 1, you probably still have multiallelic sites in your data. If the minimum value is lower than 0, your files might be corrupted.
Let us know if this was helpful.
Thank you SO much, the conversion worked and I was able to apply the neural-admixture train
function to the resulting merged bed file. I will proceed and use this for future samples. I really appreciate your help!
Hello,
I'm trying to use this tool on a multisample VCF I have with ~159 samples, I actually looped through each to only have biallelic SNPs prior to merging the VCFs into one.
bcftools view -m2 -M2 -v snps $f > biallelic_vcfs/$(basename $f | cut -d. -f1).vcf
After this, I merged the VCFs and ran the same command just to be absolutely sure I didn't have multiallelic sites in the merged VCF. I still keep getting this error:
Please let me know what I might be missing here. I appreciate any advice! Thank you so much!