apriha / snps

tools for reading, writing, merging, and remapping SNPs
BSD 3-Clause "New" or "Revised" License
98 stars 19 forks source link

ValueError: invalid literal for int() with base 10: '48169282.0': Error while type casting for column 'pos' #177

Open IMingGarson opened 3 months ago

IMingGarson commented 3 months ago

Hi, I ran into this issue when I am reading vcf file:

ValueError: invalid literal for int() with base 10: '48169282.0': Error while type casting for column 'pos'

Version: snps 2.8.1 OS: MacOS Sonoma 14.3.1 with Apple M1 chip

Here is part of my DNA raw data downloaded from AncestryDNA:

# This data file generated by 23andMe at: Sat Jun 12 11:23:12 2024
#
# This file contains raw genotype data, including data that is not used in 23andMe reports.
#
# Each line corresponds to a single SNP.  For each SNP, we provide its identifier (an rsid
# or an internal id), its location on the reference human genome, and the genotype call
# oriented with respect to the plus strand on the reference human genome.  We are using reference
# human assembly build 37 (also known as Annotation Release 104).
#
# More information on these variants can be found at http://www.ncbi.nlm.nih.gov/SNP/
#
# rsid  chromosome  position    genotype
rs1 17  21102678    AG
rs2 7   62361768    AA
rs3 18  42763504    CT
...

And here is part of the converted vcf file using to_vcf():

##fileformat=VCFv4.2
##fileDate=20240613
...
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE
1   48169282.0  rs112   C   A,T .   .   .   GT  2/1
...

As you can see "POS" became float (48169282.0) instead of np.uint32 as here suggested https://github.com/apriha/snps/blob/master/src/snps/io/reader.py

TWO_ALLELE_DTYPES = {
    "rsid": object,
    "chrom": object,
    "pos": np.uint32,
    "allele1": object,
    "allele2": object,
}

If I change "48169282.0" to "48169282" the error would be gone but I didn't know if it is expected data type. Enlightenments are appreciated.

Thank you.