Problem loading counts matrices - HPCs tutorial

sejjbia commented 5 years ago

Hi I am a python novice and I am interested in using SPRING With some adaptations I successfully managed to run the pbmc4k tutorial on python 3.7.3 Instead when trying the HPCs tutorial, upon loading the counts matrices at this stage...

for s in samplename: print '____', s

if os.path.isfile(input_path + s + '.raw_counts.unfiltered.npz'):
    print('Loading from npz file')
    D[s]['E'] = scipy.sparse.load_npz(input_path + s + '.raw_counts.unfiltered.npz')
else:
    print('Loading from text file')
    E,cell_bcs = load_text(file_opener(input_path + s + '.counts.tsv.gz'), delim = '\t', load_cell_bcs=True)
    D[s]['E'] = E
    D[s]['cell_bcs'] = cell_bcs
    scipy.sparse.save_npz(input_path + s + '.raw_counts.unfiltered.npz', D[s]['E'], compressed = True)
print(D[s]['E'].shape)

....I get the following error:

UnicodeDecodeError Traceback (most recent call last)

in 10 else: 11 print('Loading from text file')I cannot ---> 12 E,cell_bcs = load_text(file_opener(input_path + s + '.counts.tsv.gz'), delim = '\t', load_cell_bcs=True) 13 D[s]['E'] = E 14 D[s]['cell_bcs'] = cell_bcs ~/SPRING_dev-spring-of-rebirth/data_prep/spring_helper.py in load_text(file_data, delim, load_cell_bcs) 149 start_column = -1 150 start_row = -1 --> 151 for row_ix, dat in enumerate(file_data): 152 dat = dat.strip('\n').split(delim) 153 if start_row == -1: ~/anaconda3/lib/python3.7/gzip.py in readline(self, size) 372 def readline(self, size=-1): 373 self._check_not_closed() --> 374 return self._buffer.readline(size) 375 376 ~/anaconda3/lib/python3.7/_compression.py in readinto(self, b) 66 def readinto(self, b): 67 with memoryview(b) as view, view.cast("B") as byte_view: ---> 68 data = self.read(len(byte_view)) 69 byte_view[:len(data)] = data 70 return len(data) ~/anaconda3/lib/python3.7/gzip.py in read(self, size) 461 # jump to the next member, if there is one. 462 self._init_read() --> 463 if not self._read_gzip_header(): 464 self._size = self._pos 465 return b"" ~/anaconda3/lib/python3.7/gzip.py in _read_gzip_header(self) 404 405 def _read_gzip_header(self): --> 406 magic = self._fp.read(2) 407 if magic == b'': 408 return False ~/anaconda3/lib/python3.7/gzip.py in read(self, size) 89 self._read = None 90 return self._buffer[read:] + \ ---> 91 self.file.read(size-self._length+read) 92 93 def prepend(self, prepend=b''): ~/anaconda3/lib/python3.7/codecs.py in decode(self, input, final) 320 # decode input (taking the buffer into account) 321 data = self.buffer + input --> 322 (result, consumed) = self._buffer_decode(data, self.errors, final) 323 # keep undecoded input until the next call 324 self.buffer = data[consumed:] UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte I tried with no luck several workarounds on the **file_opener** module of **spring_helper.py** (I got the latest version with - Added cell_BC export - ) as my hypothesis is that this is somehow related to the GzipFile missing the utf-8 encoding. I might be completely wrong and/or missing something totally obvious...Could you help me?

calebweinreb commented 5 years ago

Hi,

Thanks for your message! Can you send the file that you are having trouble opening?

On Fri, Jun 7, 2019 at 1:08 PM sejjbia notifications@github.com wrote:

Hi I am a python novice and I am interested in using SPRING With some adaptations I successfully managed to run the pbmc4k tutorial on python 3.7.3 When trying the HPCs tutorial instead, upon loading the counts matrices at this stage...

for s in samplename: print '____', s

if os.path.isfile(input_path + s + '.raw_counts.unfiltered.npz'): print 'Loading from npz file' D[s]['E'] = scipy.sparse.load_npz(input_path + s + '.raw_counts.unfiltered.npz') else: print 'Loading from text file' E,cell_bcs = load_text(file_opener(input_path + s + '.counts.tsv.gz'), delim = '\t', load_cell_bcs=True) D[s]['E'] = E D[s]['cell_bcs'] = cell_bcs scipy.sparse.save_npz(input_path + s + '.raw_counts.unfiltered.npz', D[s]['E'], compressed = True) print D[s]['E'].shape

....I get the following error:

UnicodeDecodeError Traceback (most recent call last) in 10 else: 11 print('Loading from text file')I cannot ---> 12 E,cell_bcs = load_text(file_opener(input_path + s + '.counts.tsv.gz'), delim = '\t', load_cell_bcs=True) 13 D[s]['E'] = E 14 D[s]['cell_bcs'] = cell_bcs

~/SPRING_dev-spring-of-rebirth/data_prep/spring_helper.py in load_text(file_data, delim, load_cell_bcs) 149 start_column = -1 150 start_row = -1 --> 151 for row_ix, dat in enumerate(file_data): 152 dat = dat.strip('\n').split(delim) 153 if start_row == -1:

~/anaconda3/lib/python3.7/gzip.py in readline(self, size) 372 def readline(self, size=-1): 373 self._check_not_closed() --> 374 return self._buffer.readline(size) 375 376

~/anaconda3/lib/python3.7/_compression.py in readinto(self, b) 66 def readinto(self, b): 67 with memoryview(b) as view, view.cast("B") as byte_view: ---> 68 data = self.read(len(byte_view)) 69 byte_view[:len(data)] = data 70 return len(data)

~/anaconda3/lib/python3.7/gzip.py in read(self, size) 461 # jump to the next member, if there is one. 462 self._init_read() --> 463 if not self._read_gzip_header(): 464 self._size = self._pos 465 return b""

~/anaconda3/lib/python3.7/gzip.py in _read_gzip_header(self) 404 405 def _read_gzip_header(self): --> 406 magic = self._fp.read(2) 407 if magic == b'': 408 return False

~/anaconda3/lib/python3.7/gzip.py in read(self, size) 89 self._read = None 90 return self._buffer[read:] + ---> 91 self.file.read(size-self._length+read) 92 93 def prepend(self, prepend=b''):

~/anaconda3/lib/python3.7/codecs.py in decode(self, input, final) 320 # decode input (taking the buffer into account) 321 data = self.buffer + input --> 322 (result, consumed) = self._buffer_decode(data, self.errors, final) 323 # keep undecoded input until the next call 324 self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I tried several workarounds on the file_opener module of spring_helper.py as my hypothesis is that this is somehow related to the GzipFile missing the utf-8 encoding.

As I said I am a novice and I might be completely wrong and/or missing something totally obvious...Could you help me?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/AllonKleinLab/SPRING_dev/issues/13?email_source=notifications&email_token=ABO45MXA2BN4ZHGLODGAEITPZKI2DA5CNFSM4HVY2KVKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GYJSA2Q, or mute the thread https://github.com/notifications/unsubscribe-auth/ABO45MQ7G5LYVKEWLP2HANTPZKI2DANCNFSM4HVY2KVA .

sejjbia commented 5 years ago

Thank you for your reply. The files are the ones you provided as samples for this analyses (see below)

P9A.counts.tsv.gz P11A.counts.tsv.gz P11B.counts.tsv.gz P12A.counts.tsv.gz

Note: the files are not corrupt because I can actually open them with the following...
if fname.endswith('.gz'): os.system('gunzip -c ' + fname + ' > tmp') f = open('tmp')

...but I would really need to use your code for the barcodes extraction and for the whole downstream processing. Let me know

sejjbia commented 5 years ago

Additional note: using the corresponding four *unfiltered.npz files you provided it works up until

D[s]['cell_bcs'].shape, D[s]['total_counts'].shape

where I get the following error.

KeyError Traceback (most recent call last)

in ----> 1 D[s]['cell_bcs'].shape KeyError: 'cell_bcs' Again, I would need to make it work starting from the *counts.tsv.gz files also for future projects

swolock commented 5 years ago

Hi @sejjbia,

I think that Python 3 requires you to read the file in binary mode from the very beginning, which means you need to modify the file_opener() function. See here for a version that works for me (this link also includes most of the same helper functions and more, all Python 3 compatible).

sejjbia commented 5 years ago

Thank you swolock but no luck yet. I tried the new function you gave me both embedding it in in my spring_helper.py or by using the whole spring_helper.py version from your link.

This is the function I am using now def file_opener(filename): '''Open file and return a file object, automatically decompressing zip and gzip Arguments

filename : str Name of input file Returns
outData : file object (Decompressed) file data ''' if filename.endswith('.gz'): fileData = open(filename, 'rb') import gzip outData = gzip.GzipFile(fileobj = fileData, mode = 'rb') elif filename.endswith('.zip'): fileData = open(filename, 'rb') import zipfile zipData = zipfile.ZipFile(fileData, 'r') fnClean = filename.strip('/').split('/')[-1][:-4] outData = zipData.open(fnClean) else: outData = open(filename, 'r') return outData

Your workaround is similar to one I tried before However, now I get the following:

_____ P9A Loading from text file

TypeError Traceback (most recent call last)

in 10 else: 11 print('Loading from text file') ---> 12 E,cell_bcs = load_text(file_opener(input_path + s + '.counts.tsv.gz'), delim = '\t', load_cell_bcs=True) 13 D[s]['E'] = E 14 D[s]['cell_bcs'] = cell_bcs ~/SPRING_dev-spring-of-rebirth/data_prep/spring_helper.py in load_text(file_data, delim, load_cell_bcs) 164 start_row = -1 165 for row_ix, dat in enumerate(file_data): --> 166 dat = dat.strip('\n').split(delim) 167 if start_row == -1: 168 current_col = 0 TypeError: a bytes-like object is required, not 'str' I guess it still doesn't read it in binary mode...

swolock commented 5 years ago

Actually, I think it is now reading in binary mode (dat is a bytes-like object), but you're using a string (delim) to split it.

You need to decode the input data before treating it like a string:

for row_ix, dat in enumerate(file_data):
    if type(dat) == bytes:
        dat = dat.decode('utf-8')
    dat = dat.strip('\n').split(delim)

Or see this example.

sejjbia commented 5 years ago

Thank you swolock It seemed is was going through with your solution BUT it took several minutes to load the first file only to end up with this (BTW I got the very same error when in one of my attempts I tried to load pre-decompressed .tsv files)

_____ P9A Loading from text file

ValueError Traceback (most recent call last)
in 10 else: 11 print('Loading from text file') ---> 12 E,cell_bcs = load_text(file_opener(input_path + s + '.counts.tsv.gz'), delim = '\t', load_cell_bcs=True) 13 D[s]['E'] = E 14 D[s]['cell_bcs'] = cell_bcs

ValueError: too many values to unpack (expected 2)

swolock commented 5 years ago

I'm not quite sure why you're getting this particular error, but it's likely there are other changes you need to make this python3-compatible. For example:

rowdat = np.array(map(float, dat[current_col:]))

becomes:

rowdat = np.array(list(map(float, dat[current_col:])))

Unless you're excited about going through this exercise, you're probably better off just using my function load_annotated_text().

Use it like so:

E, cell_bcs, gene_names = hf.load_annotated_text(
    hf.file_opener(input_path + s + '.counts.tsv.gz'),
    delim='\t', 
    read_row_labels=True, 
    read_column_labels=True)

Another thing: in your previous comment, I noticed that you're using spring-of-rebirth. Although we will eventually merge this PR, I think it is still buggy.

sejjbia commented 5 years ago

Thank you swolock I embedded your last solution in the module and it worked perfectly well!! `for s in samplename: print('____', s)

if os.path.isfile(input_path + s + '.raw_counts.unfiltered.npz'):
    print('Loading from npz file')
    D[s]['E'] = scipy.sparse.load_npz(input_path + s + '.raw_counts.unfiltered.npz')
else:
    print('Loading from text file')
    E, cell_bcs, gene_names = load_annotated_text(file_opener(input_path + s + '.counts.tsv.gz'), delim='\t', read_row_labels=True, read_column_labels=True)
    D[s]['E'] = E
    D[s]['cell_bcs'] = cell_bcs
    scipy.sparse.save_npz(input_path + s + '.raw_counts.unfiltered.npz', D[s]['E'], compressed = True)
print(D[s]['E'].shape)`

Just a note once the .npz files are created I am still getting the following error at this stage:

D[s]['cell_bcs'].shape, D[s]['total_counts'].shape

KeyError Traceback (most recent call last) in ----> 1 D[s]['cell_bcs'].shape

KeyError: 'cell_bcs'

I solved it by removing from the raw_counts folder the .npz files that were generated and starting over from the .tsv.gz

AllonKleinLab / SPRING_dev

Problem loading counts matrices - HPCs tutorial #13

_____ P9A Loading from text file

_____ P9A Loading from text file