apriha / snps

tools for reading, writing, merging, and remapping SNPs
BSD 3-Clause "New" or "Revised" License
98 stars 19 forks source link

Output 23andMe from snps cannot be read by snps #54

Closed PhilPalmer closed 4 years ago

PhilPalmer commented 4 years ago

If you have a (23AndMe-like) file produced by snps, eg:

# Generated by snps v0.1.1.post243+ge018ea4, https://pypi.org/project/snps/
# Generated at 2020-01-20 22:55:53 UTC
# Source(s): 23andMe
# Assembly: GRCh37
# Phased: False
# SNPs: 638463
# Chromosomes: 1-22, X, Y, MT
rsid    chromosome  position    genotype
rs548049170 1   69869   TT
rs13328684  1   74792   --
rs9283150   1   565508  AA
i713426 1   726912  --
rs116587930 1   727841  GG
rs3131972   1   752721  AG
rs12184325  1   754105  CC
rs12567639  1   756268  AA

And then you try reading this file into snps you get the following error Too many columns specified: expected 4 and found 1:

  Traceback (most recent call last):
    File \"/opt/conda/envs/common-latest-geno/lib/python3.7/site-packages/pandas/core/indexes/base.py\", line 2897, in get_loc
      return self._engine.get_loc(key)
    File \"pandas/_libs/index.pyx\", line 107, in pandas._libs.index.IndexEngine.get_loc
    File \"pandas/_libs/index.pyx\", line 131, in pandas._libs.index.IndexEngine.get_loc
    File \"pandas/_libs/hashtable_class_helper.pxi\", line 1607, in pandas._libs.hashtable.PyObjectHashTable.get_item
    File \"pandas/_libs/hashtable_class_helper.pxi\", line 1614, in pandas._libs.hashtable.PyObjectHashTable.get_item
  KeyError: 'genotype'

This was discovered by @willgdjones who noted that the issue occurs because:

It has ‘snps’ in the header and so the snps package thinks that it is from lineage dna which uses comma-separated values as opposed to tab-separated values

How can this issue be fixed? As long as it's not too difficult or time-consuming I'd be happy to work on a PR to fix this

apriha commented 4 years ago

Hi Phil, thanks for the question. To fix the issue, we could update the read_snps_csv parser to catch pandas.errors.ParserError and then parse the file with tab as the separator. Alternatively, the header could be inspected to determine the separator.

Related, tab does seem like a better choice to use as a default separator when generating these files, since the comments include commas...