RAHenriksen / NGSNGS

NGSNGS: Next generation simulator for next generation sequencing data
46 stars 4 forks source link

NGSNGS fails to recognize any genotyp from vcf #31

Closed Npaffen closed 1 year ago

Npaffen commented 1 year ago

When I try to simulate a paired-end bam file using an input vcf to provide mutations for the reference fasta the tool seems to be unable to recognize any kind genotype. I tried both phasing parameter for the genotype representation (phased '|' and unphased '/'). No matter what I try the tool just prompts for every position:

      -> Genotype is missing for pos: 41818328 will skip
        -> Genotype is missing for pos: 41818387 will skip
        -> Genotype is missing for pos: 41818426 will skip
        -> Genotype is missing for pos: 41818444 will skip
        -> Genotype is missing for pos: 41818445 will skip
        -> Genotype is missing for pos: 41818458 will skip
        -> Genotype is missing for pos: 41818534 will skip
        -> Genotype is missing for pos: 41818543 will skip
        -> Genotype is missing for pos: 41818573 will skip
        -> Genotype is missing for pos: 41818591 will skip
        -> Genotype is missing for pos: 41818593 will skip
        -> Genotype is missing for pos: 41818613 will skip
        -> Genotype is missing for pos: 41818645 will skip
        -> Genotype is missing for pos: 41818702 will skip
        -> Genotype is missing for pos: 41818968 will skip
        -> Genotype is missing for pos: 41818976 will skip
        -> Genotype is missing for pos: 41818979 will skip
        -> Genotype is missing for pos: 41819017 will skip
        -> Genotype is missing for pos: 41819022 will skip

Furthermore when it reached the end of the last bp position it just starts again to iterate from the beginning leading to an infinite loop.

I used the following command : ./ngsngs -i ~/calling/references/GRCh37/hg19_split/chr22.fa -c 1 -seq PE -f bam -o ~/read_sampler/embryo_1 -l 150 -q1 Test_Examples/AccFreqL150R1.txt -q2 Test_Examples/AccFreqL150R2.txt -vcf ~/read_sampler/embryo_1_header_geno_chr22.vcf.gz All files to replicate this issue can be downloaded here : https://we.tl/t-ahBiPjXrxx

ANGSD commented 1 year ago

Dear @Npaffen , thanks for trying our software and contacting us about the issue. It is much appreciated that you have taken the time to contact us so we can improve our program. It is also very valuable that you have uploaded the needed files so we can reproduce the issue.

First, a small comment. You should ensure that the refidname in your fasta matches the referenceids in the vcf/bcf. In the files supplied the chromosome name was chr22 and in the vcf it was 22. This is a minor thing and was not what was causing the error and might have been caused when you prepared the example dataset.

The quick solution to the actual issue is that you should add -id 0. This will tell the program that you are interested in using the GT tag for the first individual. This is of course, something that the program should check when it runs such that it can give the user a proper error message. The program was made for the general situation where the user might have a big vcffile containing genotype calls for many samples, and then they could use the -id option to tell precisely which sample to use as the basis for generating reads.

I also assume that most users would have a vcf just as yours with GT for a single sample, and if that is the case then the program should set the -id 0 internally.

I will keep this issue open so we dont forget to update the program.

Again thanks for reporting this issue.

Best regards

RAHenriksen commented 1 year ago

Dear Npaffen,

This issue have now been resolved. Please let use know if you encounter more issues regarding VCF simulations