cggh / scikit-allel

A Python package for exploring and analysing genetic variation data
MIT License
287 stars 49 forks source link

ALT fields with . + seq in vcf_to_dataframe are annotated as nan. #349

Open gmgs-999 opened 3 years ago

gmgs-999 commented 3 years ago

Hi, I'm Gabriel. I'm doing my thesis with SV's and vcf files. I'm doing a script to annotate SV's BND format in short annotation. One type of insertion is a single breakend INS and it's annotated like: REF: A ; ALT: .AGTA(etc..). Example: ALELO_EJEMPLO The variant with gridss273b_534b ID starts with "." (this variant was annotated by SNPEff as INS.). In python after using vcf_1kg=allel.vcf_to_dataframe(ruta_vcf,fields=['*'],exclude_fields=['FILTER_NO_RP', 'FILTER_SINGLE_ASSEMBLY','FILTER_ASSEMBLY_TOO_FEW_READ', 'FILTER_NO_ASSEMBLY', 'FILTER_NO_SR','FILTER_INSUFFICIENT_SUPPORT', 'FILTER_ASSEMBLY_BIAS','FILTER_ASSEMBLY_ONLY', 'FILTER_ASSEMBLY_TOO_SHORT','FILTER_SMALL_EVENT', 'FILTER_REF', 'FILTER_SINGLE_SUPPORT','FILTER_LOW_QUAL','FILTER_SnpSift'],alt_number=1)(OBS: ruta_vcf it's the PATH to the vcf. STRAND ,INS_LEN and INS_SEQ fields was added after to DataFrame in script) ALELO_EJEMPLO_python ALT field is annotated as nan. I know that initial "." means missing value, and it is the reason to annotate as nan these fields. Can it be solved?. I'm using allel V1.3.2 and python 3.5.3. Thank you.