brentp / cyvcf2

cython + htslib == fast VCF and BCF processing
MIT License
369 stars 72 forks source link

##INFO fields being dropped #302

Open FlexLuthORF opened 6 months ago

FlexLuthORF commented 6 months ago
 cat 2202415007_hemi.vcf | head

/home/zmvanw01/projects/12-sample-comparison/hifiasm-path/geno_analysis/per_samp/2202415007/change_to_hemi/2202415007_hemi.vcf

=mpileup -B -a QS -Ou -f /home/zmvanw01/git_repos/immune_receptor_genomics/current/reference.fasta --threads 11 /home/zmvanw01/projects/12-sample-comparison/hifiasm-path/2202415007/2202415007.editRG.bam

##reference=file:///home/zmvanw01/git_repos/immune_receptor_genomics/current/reference.fasta

##contig=<ID=chr10,length=133797422>

##contig=<ID=chr11,length=135086622>

##contig=<ID=chr12,length=133275309>

##contig=<ID=chr13,length=114364328>

##contig=<ID=chr14,length=105480000>

##contig=<ID=chr15,length=101991189>

##contig=<ID=chr16,length=90338345>

(bcftools) [zmvanw01@bigdata change_to_hemi]$ cat ../2202415007.vcf | head

##fileformat=VCFv4.2

##FILTER=<ID=PASS,Description="All filters passed">

##bcftoolsVersion=1.19+htslib-1.19.1

##bcftoolsCommand=mpileup -B -a QS -Ou -f /home/zmvanw01/git_repos/immune_receptor_genomics/current/reference.fasta --threads 11 /home/zmvanw01/projects/12-sample-comparison/hifiasm-path/2202415007/2202415007.editRG.bam

##reference=file:///home/zmvanw01/git_repos/immune_receptor_genomics/current/reference.fasta

I am editing the below vcf file with the following script:

def change_genotypes(input_vcf, bedfile, output_path):
    vcf_reader = VCF(input_vcf, strict_gt=True)

    # Create a new VCF Writer using the input VCF as a template
    vcf_writer = Writer(output_path, vcf_reader)

    coords = read_bedfile(bedfile)

    for record in vcf_reader:
        for index, sample in enumerate(vcf_reader.samples):
            change_snp = False
            for chrom, start, end in coords:
                if record.CHROM != chrom:
                    continue
                if record.POS < start:
                    continue
                if record.POS > end:
                    continue
                change_snp = True
                break

            if change_snp:
                current_genotype = record.genotypes[index]
                if current_genotype is None or -1 in current_genotype:
                    record.genotypes[index] = [-1, -1, False]  # Set genotype to unknown
                elif 1 in current_genotype:
                    record.genotypes[index] = [-1, 1, False]  # Set genotype to ./1
                else:
                    record.genotypes[index] = [-1, 0, False]  # Set genotype to ./0

                # Reassign the genotypes field
                record.genotypes = record.genotypes

        vcf_writer.write_record(record)

    vcf_writer.close()
    vcf_reader.close()

I expect it to copy all ##info fields from the template to the new vcf, but it seems to truncate it? I am missing:

fileformat=VCFv4.2

FILTER=

bcftoolsVersion=1.19+htslib-1.19.1

bcftoolsCommand

from the newly created file and it is causing downstream issues. It also seems to be placing the path to the new file right at the top with now ## before it.

brentp commented 6 months ago

hi @FlexLuthORF can you attach a small VCF (can be just the header and a single variant) that shows the issue?

FlexLuthORF commented 6 months ago

2201410002_head70.vcf.txt 2201410002_hemi_head70.vcf.txt

The first vcf is the one I am editing with cyvcf2 and the second one is the output that has strangely missing and cleaved ## fields.

brentp commented 6 months ago

If I run this:

python issue302.py 2201410002_head70.vcf.txt > o.vcf
bcftools view o.vcf > /dev/null

I get no error. Here are the contents of issue302.py. If you can show me how to recreate, then we can debug, but I suspect something else is going on:

from cyvcf2 import VCF, Writer

def change_genotypes(input_vcf, bedfile, output_path):
    vcf_reader = VCF(input_vcf, strict_gt=True)

    # Create a new VCF Writer using the input VCF as a template
    vcf_writer = Writer(output_path, vcf_reader)

    for record in vcf_reader:
        for index, sample in enumerate(vcf_reader.samples):

            if True:
                current_genotype = record.genotypes[index]
                if current_genotype is None or -1 in current_genotype:
                    record.genotypes[index] = [-1, -1, False]  # Set genotype to unknown
                elif 1 in current_genotype:
                    record.genotypes[index] = [-1, 1, False]  # Set genotype to ./1
                else:
                    record.genotypes[index] = [-1, 0, False]  # Set genotype to ./0

                # Reassign the genotypes field
                record.genotypes = record.genotypes

        vcf_writer.write_record(record)

    vcf_writer.close()
    vcf_reader.close()

if __name__ == "__main__":
    import sys
    change_genotypes(sys.argv[1], None, "-")
FlexLuthORF commented 6 months ago

Can you show the output vcf file? I get no traceback error either. It's that it messes up the ## fields and if ##fileformat is missing then I cant bcftools index the .vcf.gz

It does change the genotypes and all that. Its just dropping ## fields and appending the new file location at the top with no ## before it.

`cat 2201410002_hemi.bed

chr2 90225183 90249908 IGKV1-NL1`

thats the bed file it is using in case that helps. Only one line

brentp commented 6 months ago

This happens for you using my code above? The o.vcf is well formed with the correct headers.

FlexLuthORF commented 6 months ago

Interesting. You are correct, your code has all the ## fields. I cant figure out why the bit of me checking if the entry is within coords on a bed file is causing the issue. Thank you for your help. If I figure out the issue I will post it here... Though for now, since it was only the top few ## I am just using some bash commands to strip and re-add the info fields so I can continue my analysis.

Perhaps how I am writing the vcf afterwards?

    output_vcf="${sample_outd}/change_to_hemi/${sample}_hemi.vcf"

    python "${changeg}" "${of}.vcf" "$sample_sv_results" \
        "${SV_regions_entire}" "${sample}" > "${output_vcf}"

I am calling the python script within a batch script and redircting the returned vcf into a output file. Yeah I think this is the problem and its an issue on my side.