flo-compbio / pyaffy

pyAffy: Processing raw data from Affymetrix expression microarrays in Python.
GNU General Public License v3.0
17 stars 4 forks source link

Seems like CEL files must be gzipped beforehand? Why is that? #12

Open alexlenail opened 7 years ago

alexlenail commented 7 years ago
[2017-08-17 15:28:53] INFO: Parsing CDF file.
[2017-08-17 15:28:56] INFO: CDF file parsing time: 3.70 s
[2017-08-17 15:28:56] INFO: CDF array design name: b'HTA-2_0.r1.gene'
[2017-08-17 15:28:56] INFO: CDF rows / columns: 2572 x 2680
[2017-08-17 15:28:56] INFO: Parsing CEL files...
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
<ipython-input-17-afc2e0487185> in <module>()
      3 cdf_file = "/Users/alex/Desktop/KVH/data/CELfiles/HTA-2_0.r1.gene.cdf"
      4 
----> 5 genes, samples, X = rma(cdf_file, sample_cel_files)

~/Desktop/KVH/venv/lib/python3.6/site-packages/pyaffy/process.py in rma(cdf_file, sample_cel_files, pm_probes_only, bg_correct, quantile_normalize, medianpolish)
    139         logger.debug('Parsing CEL file for sample "%s": %s', sample, cel_file)
    140         samples.append(sample)
--> 141         y = parse_cel(cel_file)
    142         Y[:,j] = y[pm_sel]
    143     sub_logger.setLevel(logging.NOTSET)

pyaffy/celparser.pyx in pyaffy.celparser.parse_cel()

/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/gzip.py in read(self, size)
    274             import errno
    275             raise OSError(errno.EBADF, "read() on write-only GzipFile object")
--> 276         return self._buffer.read(size)
    277 
    278     def read1(self, size=-1):

/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/_compression.py in readinto(self, b)
     66     def readinto(self, b):
     67         with memoryview(b) as view, view.cast("B") as byte_view:
---> 68             data = self.read(len(byte_view))
     69             byte_view[:len(data)] = data
     70         return len(data)

/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/gzip.py in read(self, size)
    461                 # jump to the next member, if there is one.
    462                 self._init_read()
--> 463                 if not self._read_gzip_header():
    464                     self._size = self._pos
    465                     return b""

/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/gzip.py in _read_gzip_header(self)
    409 
    410         if magic != b'\037\213':
--> 411             raise OSError('Not a gzipped file (%r)' % magic)
    412 
    413         (method, flag,

OSError: Not a gzipped file (b'\x00\x00')
peastman commented 6 years ago

I'm running into the same problem. I believe it's caused by these lines from try_open_gzip() in celparser.pyx:

    try:
        fh = gzip.open(path)
        fh.read(1)
    except IOError:
        pass

It's checking for IOError. But the gzip reader throws an OSError instead, which it doesn't catch.

peastman commented 6 years ago

Here is a working version of try_open_gzip():

def try_open_gzip(path):

    fh = None
    try:
        fh = gzip.open(path)
        fh.read(1)
    except (IOError, OSError):
        fh = None
    else:
        fh = gzip.open(path)

    return fh

There are two changes. It catches OSError, and the error handler needs to set fh to None.