Cloufield / gwaslab

A Python package for handling and visualizing GWAS summary statistics. https://cloufield.github.io/gwaslab/
GNU General Public License v3.0
118 stars 22 forks source link

Assigning rsID error #51

Open lphotoimpact opened 10 months ago

lphotoimpact commented 10 months ago

I followed the website tutorial and downloaded the following files: gl.download_ref("1kg_dbsnp151_hg19_auto") GCF_000001405.25.gz GCF_000001405.25.gz.tbi

I followed the instructions step by step, but I kept encountering errors in the end.

# download ref SNPID-rsID table first
gl.download_ref("1kg_dbsnp151_hg19_auto")

# if you want to annotate as much as possible. Please download the very large dbSNP vcf file.

mysumstats = gl.Sumstats("t2d_bbj.txt.gz",
             snpid="SNP",
             chrom="CHR",
             pos="POS",
             ea="ALT",
             nea="REF",
             neaf="Frq",
             beta="BETA",
             se="SE",
             p="P",
             direction="Dir",
             n="N",
             nrows=10000)

# run basic_check first
mysumstats.basic_check() 

# if you SNPID is like 1:725932_G_A , you can use fix_id to fix the separator.
mysumstats.fix_id(fixsep=True)
# rsID annotation
mysumstats.assign_rsid( n_cores = 2,
                        ref_rsid_tsv = gl.get_path("1kg_dbsnp151_hg19_auto"),
                        ref_rsid_vcf ="/home/green/GCF_000001405.25.gz",
                        chr_dict = gl.get_number_to_NC(build="19"))
Tue Sep  5 11:37:41 2023 Start to annotate rsID based on chromosome and position information...
Tue Sep  5 11:37:41 2023  -Current Dataframe shape : 10000  x  13
Tue Sep  5 11:37:41 2023  -SNPID-rsID text file: /home/green/.gwaslab/1kg_dbsnp151_hg19_auto.txt.gz
Tue Sep  5 11:37:41 2023  -10000 rsID could be possibly fixed...
Tue Sep  5 11:37:41 2023  -Setting block size:  5000000
Tue Sep  5 11:37:41 2023  -Loading block: 0   
---------------------------------------------------------------------------
EOFError                                  Traceback (most recent call last)
Cell In[38], line 2
      1 # rsID annotation
----> 2 mysumstats.assign_rsid( n_cores = 2,
      3                         ref_rsid_tsv = gl.get_path("1kg_dbsnp151_hg19_auto"),
      4                         ref_rsid_vcf ="/home/green/GCF_000001405.25.gz",
      5                         chr_dict = gl.get_number_to_NC(build="19"))

File ~/anaconda3/envs/python38/lib/python3.8/site-packages/gwaslab/Sumstats.py:366, in Sumstats.assign_rsid(self, ref_rsid_tsv, ref_rsid_vcf, **args)
    361 def assign_rsid(self,
    362                 ref_rsid_tsv=None,
    363                 ref_rsid_vcf=None,
    364                 **args):
    365     if ref_rsid_tsv is not None:
--> 366         self.data = parallelizeassignrsid(self.data,path=ref_rsid_tsv,ref_mode="tsv",log=self.log,**args)
    367         self.meta["gwaslab"]["references"]["ref_rsid_tsv"] = ref_rsid_tsv
    368     if ref_rsid_vcf is not None:

File ~/anaconda3/envs/python38/lib/python3.8/site-packages/gwaslab/retrievedata.py:380, in parallelizeassignrsid(sumstats, path, ref_mode, snpid, rsid, chr, pos, ref, alt, status, n_cores, chunksize, ref_snpid, ref_rsid, overwrite, verbose, log, chr_dict)
    378 if verbose:  log.write(" -Setting block size: ",chunksize)
    379 if verbose:  log.write(" -Loading block: ",end="")     
--> 380 for i,dic in enumerate(dic_chuncks):
    381     gc.collect()
    382     log.write(i," ",end=" ",show_time=False)  

File ~/anaconda3/envs/python38/lib/python3.8/site-packages/pandas/io/parsers/readers.py:1186, in TextFileReader.__next__(self)
   1184 def __next__(self):
   1185     try:
-> 1186         return self.get_chunk()
   1187     except StopIteration:
   1188         self.close()

File ~/anaconda3/envs/python38/lib/python3.8/site-packages/pandas/io/parsers/readers.py:1283, in TextFileReader.get_chunk(self, size)
   1281         raise StopIteration
   1282     size = min(size, self.nrows - self._currow)
-> 1283 return self.read(nrows=size)

File ~/anaconda3/envs/python38/lib/python3.8/site-packages/pandas/io/parsers/readers.py:1253, in TextFileReader.read(self, nrows)
   1251 nrows = validate_integer("nrows", nrows)
   1252 try:
-> 1253     index, columns, col_dict = self._engine.read(nrows)
   1254 except Exception:
   1255     self.close()

File ~/anaconda3/envs/python38/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py:225, in CParserWrapper.read(self, nrows)
    223 try:
    224     if self.low_memory:
--> 225         chunks = self._reader.read_low_memory(nrows)
    226         # destructive to chunks
    227         data = _concatenate_chunks(chunks)

File ~/anaconda3/envs/python38/lib/python3.8/site-packages/pandas/_libs/parsers.pyx:817, in pandas._libs.parsers.TextReader.read_low_memory()

File ~/anaconda3/envs/python38/lib/python3.8/site-packages/pandas/_libs/parsers.pyx:861, in pandas._libs.parsers.TextReader._read_rows()

File ~/anaconda3/envs/python38/lib/python3.8/site-packages/pandas/_libs/parsers.pyx:847, in pandas._libs.parsers.TextReader._tokenize_rows()

File ~/anaconda3/envs/python38/lib/python3.8/site-packages/pandas/_libs/parsers.pyx:1952, in pandas._libs.parsers.raise_parser_error()

File ~/anaconda3/envs/python38/lib/python3.8/_compression.py:68, in DecompressReader.readinto(self, b)
     66 def readinto(self, b):
     67     with memoryview(b) as view, view.cast("B") as byte_view:
---> 68         data = self.read(len(byte_view))
     69         byte_view[:len(data)] = data
     70     return len(data)

File ~/anaconda3/envs/python38/lib/python3.8/gzip.py:498, in _GzipReader.read(self, size)
    496         break
    497     if buf == b"":
--> 498         raise EOFError("Compressed file ended before the "
    499                        "end-of-stream marker was reached")
    501 self._add_read_data( uncompress )
    502 self._pos += len(uncompress)

EOFError: Compressed file ended before the end-of-stream marker was reached
lphotoimpact commented 10 months ago

Upon further investigation, When I tried to execute gl.download_ref("1kg_dbsnp151_hg19_auto") the file 1kg_dbsnp151_hg19_auto already existed It reported an MD5 checksum failure But, it didn't attempt to download the file again, and the command ended.

I manually checked the 1kg_dbsnp151_hg19_auto file and found it to be too small. I had to resort to manually executing gl.check_available_ref() Download it from the Dropbox path and overwrite the existing file.

Cloufield commented 10 months ago

Hi, Thanks a lot for your feedback. Indeed, I need to add an option for redownloading overwriting the existing files when md5sum check fails. I will implement this in the next version.