AI-sandbox / neural-admixture

Rapid population clustering with autoencoders
62 stars 9 forks source link

after selecting only biallelic SNPs in multi-sample VCF, still getting error "AssertionError: Only biallelic SNPs are supported. Please make sure multiallelic sites have been removed." please advise #24

Closed seetarajpara closed 1 month ago

seetarajpara commented 1 month ago

Hello,

I'm trying to use this tool on a multisample VCF I have with ~159 samples, I actually looped through each to only have biallelic SNPs prior to merging the VCFs into one.

bcftools view -m2 -M2 -v snps $f > biallelic_vcfs/$(basename $f | cut -d. -f1).vcf

After this, I merged the VCFs and ran the same command just to be absolutely sure I didn't have multiallelic sites in the merged VCF. I still keep getting this error:

(nadmenv) (base) Seeta@MacBook-Pro-USC multisample_vcf % neural-admixture train --k 7 --data_path batch3_biallelic_v2.vcf.gz --save_dir /Users/Seeta/Desktop/BCSB_GDA_array/admixture_output/ --name test --max_epochs 5 --save_every 5 --seed 42 --initialization pckmeans
INFO:neural_admixture.entry:Neural ADMIXTURE - Version 1.4.1
INFO:neural_admixture.entry:[CHANGELOG] Mean imputation for missing data was added in version 1.4.0. To reproduce old behaviour, please use `--imputation zero` when invoking the software.
INFO:neural_admixture.entry:[CHANGELOG] Default P initialization was changed to 'pckmeans' in version 1.3.0.
INFO:neural_admixture.entry:[CHANGELOG] Warmup training for initialization of Q was added in version 1.3.0 to improve training stability (only for `pckmeans`).
INFO:neural_admixture.entry:[CHANGELOG] Convergence check changed so it is performed after 15 epochs in version 1.3.0 to improve training stability.
INFO:neural_admixture.entry:[CHANGELOG] Default learning rate was changed to 1e-5 instead of 1e-4 in version 1.3.0 to improve training stability.
/Users/Seeta/venv/nadmenv/lib/python3.9/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
INFO:neural_admixture.src.utils:Reading data...
INFO:neural_admixture.src.snp_reader:Input format is VCF.
Traceback (most recent call last):
  File "/Users/Seeta/venv/nadmenv/bin/neural-admixture", line 8, in <module>
    sys.exit(main())
  File "/Users/Seeta/venv/nadmenv/lib/python3.9/site-packages/neural_admixture/entry.py", line 19, in main
    sys.exit(train.main(arg_list[2:]))
  File "/Users/Seeta/venv/nadmenv/lib/python3.9/site-packages/neural_admixture/src/train.py", line 139, in main
    trX, trY, valX, valY = utils.read_data(tr_file, val_file, tr_pops_f, val_pops_f, imputation=args.imputation)
  File "/Users/Seeta/venv/nadmenv/lib/python3.9/site-packages/neural_admixture/src/utils.py", line 126, in read_data
    tr_snps = snp_reader.read_data(tr_file, imputation)
  File "/Users/Seeta/venv/nadmenv/lib/python3.9/site-packages/neural_admixture/src/snp_reader.py", line 124, in read_data
    assert int(G.min().compute()) == 0 and int(G.max().compute()) == 1, 'Only biallelic SNPs are supported. Please make sure multiallelic sites have been removed.'
AssertionError: Only biallelic SNPs are supported. Please make sure multiallelic sites have been removed.

Please let me know what I might be missing here. I appreciate any advice! Thank you so much!

joansaurina commented 1 month ago

Hey @seetarajpara,

Could you share the exact commands you are using in the loop and the final merge, step by step?

Thanks.

seetarajpara commented 1 month ago

@joansaurina yes, here's what I tried:

RUN_DIR='/Users/Seeta/Desktop/BCSB_GDA_array/VCFs/biallelic_vcfs/multisample_vcf'
for f in ${RUN_DIR}/*.vcf
do
bcftools view -m2 -M2 -v snps $f > biallelic_vcfs/$(basename $f | cut -d. -f1).vcf
done

for f in biallelic_vcfs/*.vcf
bgzip $f | tabix -f -p vcf
done

# Combine all VCF files into one
ls *.vcf.gz > vcf_list.txt

# Use bcftools merge with the file list
bcftools merge -Oz -l vcf_list.txt -o multisample_vcf/batch3_biallelic.vcf.gz
seetarajpara commented 1 month ago

I should add, I also confirmed my merged VCF has only SNPs:

(nadmenv) (base) Seeta@MacBook-Pro-USC multisample_vcf % bcftools view -Oz -m 2 -M 2 --types=snps batch3_biallelic_noDups.vcf.gz | bcftools stats - | awk '/^SN/' > test.vcf.gz
SN      0       number of samples:      159
SN      0       number of records:      1127994
SN      0       number of no-ALTs:      0
SN      0       number of SNPs: 1127994
SN      0       number of MNPs: 0
SN      0       number of indels:       0
SN      0       number of others:       0
SN      0       number of multiallelic sites:   0
SN      0       number of multiallelic SNP sites:       0
joansaurina commented 1 month ago

You can try entering python in a terminal where you have your file and on the neural-admixture conda environment, and then do the following:

import allel
import numpy as np
f_tr = allel.read_vcf('filename') #substitute filename with yours.
calldata = f_tr["calldata/GT"]
A = np.sum(calldata, axis=2).T/2
print('A max:', int(A.max()))
print('A min:', int(A.min()))

If the maximum value is higher than 1, you probably still have multiallelic sites in your data. If the minimum value is lower than 0, your files might be corrupted.

Let us know if this was helpful.

seetarajpara commented 1 month ago

You're right, it looks as though the file is corrupt. I'm not sure at what point this happened, because I tested an individual sample file as well (below)

>>> f_tr = allel.read_vcf('batch3_biallelic_noDups.vcf.gz')
>>> calldata = f_tr["calldata/GT"]
>>> A = np.sum(calldata, axis=2).T/2
>>> print('A max:', int(A.max()))
A max: 2
>>> print('A min:', int(A.min()))
A min: -1

one of the individual samples:

>>> f_tr = allel.read_vcf('159_Salhia_GDA_BCSB0419.vcf.gz')
>>> calldata = f_tr["calldata/GT"]
>>> A = np.sum(calldata, axis=2).T/2
>>> print('A max:', int(A.max()))
A max: 2
>>> print('A min:', int(A.min()))
A min: 0

I will try to merge these individual files and test the tool out again. Thanks for this tip! Will circle back shortly.

seetarajpara commented 1 month ago

ok so I merged 3 of these individual samples, this is where I see the -1 value for the A.min:

>>> f_tr = allel.read_vcf('test_157_158_159.vcf.gz')
>>> calldata = f_tr["calldata/GT"]
>>> A = np.sum(calldata, axis=2).T/2
>>> print('A max:', int(A.max()))
A max: 2
>>> print('A min:', int(A.min()))
A min: -1

Do you have any suggestions for merging VCFs of multiple samples? I think this is where my problem is occurring.

joansaurina commented 1 month ago

If the software you were using to manipulate the files was indeed the problem, we suggest using PLINK, which can be installed from:

https://www.cog-genomics.org/plink/

After having installed this software you should run the following commands:

# Step 1: Convert each VCF file to PLINK format
# Loop through each VCF file in the directory and convert it to PLINK binary format
for f in ${RUN_DIR}/*.vcf
do
  # Use PLINK to convert VCF to PLINK format
  ./plink --vcf $f --make-bed --out biallelic_vcfs/$(basename $f | cut -d. -f1)
done

# Step 2: Create a list of PLINK files to merge
# List all the .bed files and save the base names (without .bed extension) in plink_file_list.txt
ls biallelic_vcfs/*.bed | sed 's/\.bed$//' > plink_file_list.txt

# Step 3: Merge the PLINK files
# Use PLINK to merge all the PLINK files listed in plink_file_list.txt
./plink --merge-list plink_file_list.txt --make-bed --out multisample_vcf/merged

# Step 4: Filter the merged PLINK file to keep only biallelic SNPs
# Use PLINK to filter the merged file and retain only biallelic SNPs
./plink --bfile multisample_vcf/merged --biallelic-only --make-bed --out multisample_vcf/biallelic_merged

# Step 5: Convert the filtered PLINK file back to VCF format
# If a VCF file is needed, convert the final filtered PLINK file back to VCF format
./plink --bfile multisample_vcf/biallelic_merged --recode vcf --out multisample_vcf/biallelic_merged

Then go back to:

You can try entering python in a terminal where you have your file and on the neural-admixture conda environment, and then do the following:

import allel
import numpy as np
f_tr = allel.read_vcf('filename') #substitute filename with yours.
calldata = f_tr["calldata/GT"]
A = np.sum(calldata, axis=2).T/2
print('A max:', int(A.max()))
print('A min:', int(A.min()))

If the maximum value is higher than 1, you probably still have multiallelic sites in your data. If the minimum value is lower than 0, your files might be corrupted.

Let us know if this was helpful.

seetarajpara commented 1 month ago

Thank you SO much, the conversion worked and I was able to apply the neural-admixture train function to the resulting merged bed file. I will proceed and use this for future samples. I really appreciate your help!