frankvogt / vcf2gwas

Python API for comprehensive GWAS analysis using GEMMA
GNU General Public License v3.0
84 stars 29 forks source link

plink limitation on invalid --chr-set #10

Closed drhoads closed 2 years ago

drhoads commented 2 years ago

Intriguing software release, and very timely. Iinstalled in a WSL2-Ubuntu environment. Ran on a vcf for 1M+ filtered bi-allelic SNPs for 40 chickens from whole genome resequencing, mapped onto the bGalgal1 latest assembly (Bowtie2-samtools-NGSEP pipeline). I had replaced accession numbers in the VCF with 2 character IDs.the program crashed on the conversion to PLEINK and Editing .fam files. During the run after vcf_parse and Contig names not defined in the header warnings got message: Error: Invalid --chr-set parameter '147', and then direction to try 'plink --help ' for more help. This is the issue I ran into years ago when trying to use plink on some SNPchip data that plink didn't like my "chromosome set'.
vcf2gwas.Male.sample.log.txt vcf2gwas_Male.sample.log.txt

drhoads commented 2 years ago

Same error if the vcf contains the accessions rather than Chr designations. I am guessing it doesn't like an odd number for the number of autosomes or just that 147 is way too many.

frankvogt commented 2 years ago

It seems that plink is confusing your character IDs as chromosomes, could you send me a snippet from your VCF file? Thanks!

drhoads commented 2 years ago

I have attached the first 1000 and last 1000 lines from a typical vcf. Had to zip them because the gihub interface doesn't support attaching a .vcf. Easier to see the change in contig identifiers in the tail1000.vcf where the last contig is MT. vcf_files.zip

frankvogt commented 2 years ago

So, plink is indeed limited to 95 chromosomes, but since you have a lot of smaller contigs, plink should be able to handle them if you rename the contigs in the following scheme: contig1, contig2, etc. Most important thing for plink is to have a consistent naming scheme (more details in the plink docs) That is probably the issue since your VCF contains "chromosomes" labeled with numbers as well as letters. Once that is done everything should work normally in vcf2gwas (and also in plink by using the --allow-extra-chr flag).

drhoads commented 2 years ago

Hmmm, I had used the 2 letter designations for the unplaced contigs because I have been using SNPtest to analyze the data and SNPtest only allows 2 characters. I will try your suggestion to see if switching to contigXX will be acceptable to plink

drhoads commented 2 years ago

Unfortunately after converting the .vcf to contain chromosome designations and the unplaced contigs to contig1 thru contig 99, I got the same error at 'converting to PLING BED': Error: Invalid --chr-set parameter '147', although the output says 'Successfuly converted to PLINK BED (Duration: 0.9 seconds)' Nothing of significance in the .log.txt file vcf2gwas.malephenotype.log.txt vcf2gwas_malephenotype.log.txt s

frankvogt commented 2 years ago

Could you update vcf2gwas with conda update vcf2gwas -c bioconda -c conda-forge -c fvogt257 and rerun the analysis? I changed the plink options for contig numbers > 95, so hopefully the problem is solved.

drhoads commented 2 years ago

That did change things but it changed it to a different error. Still occurs at Converting to PLINK BED.. but it is now: Error: Invald chromsome code '27' on line 3852642 of .vcf file. (This is disallowed for humans. Check iif the problem is with your data, or if you forgot to define a different chromosoome set the e.g. --chr-set.)

frankvogt commented 2 years ago

Could you upload the VCF file somewhere and provide me with the link where I can download it? That would be really helpful for testing and solving this issue. Thanks!

drhoads commented 2 years ago

I will try to do that on Friday. I am on the road until then and don't have access to the workspace.

drhoads commented 2 years ago

I will try to do that on Friday. I am on the road until then and don't have access to the workspace.

drhoads commented 2 years ago

I can put the files in a box.com folder and share that folder with you if you send me an email address to authorize in box. You can send privately to drhoads@uark.edu

frankvogt commented 2 years ago

So as plink doesn't seem to be able to handle the mix of that many chromosomes and contigs, there are two solutions:

Before you run the analysis, update vcf2gwas again (conda update vcf2gwas -c bioconda -c conda-forge -c fvogt257) since I tweaked the plink settings once more.

drhoads commented 2 years ago

Got bcftools installed and tried to run the annotate option but getting error that Contig 'NC_052532.1' is not defined in the header. (Quick workaround: index the file with tabix.) Unfortunately that is not clear to me.

drhoads commented 2 years ago

I went back to my tried and true replacement using sed so I won't need the bcftools to change the annotation. I will update vcf2gwas and then let you know.

drhoads commented 2 years ago

After I used sed to replace Accession with contigXXX I am running vcf2gwas and when it got to Compressing VCF file I get a lot of [W::vcf_parse] Contig 'contigXXX' is not defined in the header. {Quick workaround: index the file with tabix.) . I don't remember that from before, but it did get through the compression and entered the Indexing VCF file step. May be working now.

drhoads commented 2 years ago

Yep, vcf2gwas run finished. Thanks so much for working through this with me.

frankvogt commented 2 years ago

Great to hear, I'm glad I could help! (Btw, the command to index the VCF file via bcftools is bcftools index -f [filename] which should be executed before the other bcftools command)