Rostlab / SeqVec

Modelling the Language of Life - Deep Learning Protein Sequences
http://embed.protein.properties
MIT License
116 stars 13 forks source link

Error while trying to embed data #21

Closed rothita closed 3 years ago

rothita commented 3 years ago

Hi, I'm running SeqVec-master/seqvec_embedder.py on some protein data that I have. my DB is splitted into chunks and while most chunks were embedded successfully, some jobs were failed with the following error:

Traceback (most recent call last): File "/home/seqvec/SeqVec//lib/SeqVec-master/seqvec_embedder.py", line 258, in main() File "/home/seqvec/SeqVec//lib/SeqVec-master/seqvec_embedder.py", line 254, in main cpu_flag, max_chars, per_prot, verbose ) File "/home/seqvec/SeqVec//lib/SeqVec-master/seqvec_embedder.py", line 168, in get_embeddings np.savez( emb_path, *emb_dict) File "<__array_function__ internals>", line 6, in savez File "/home/seqvec/SeqVec/env/lib/python3.6/site-packages/numpy/lib/npyio.py", line 616, in savez _savez(file, args, kwds, False) File "/home/seqvec/SeqVec/env/lib/python3.6/site-packages/numpy/lib/npyio.py", line 720, in _savez with zipf.open(fname, 'w', force_zip64=True) as fid: File "/home/software/anaconda3/lib/python3.6/zipfile.py", line 1355, in open return self._open_to_write(zinfo, force_zip64=force_zip64) File "/home/software/anaconda3/lib/python3.6/zipfile.py", line 1468, in _open_to_write self.fp.write(zinfo.FileHeader(zip64)) File "/home/software/anaconda3/lib/python3.6/zipfile.py", line 427, in FileHeader len(filename), len(extra)) struct.error: ushort format requires 0 <= number <= (0x7fff 2 + 1)

I don't think it's a memory issue since I tried splitting those chunks into smaller ones and got the same error. do you have any idea what is causing the error and how to solve it? I didn't managed to find helpful solutions online.

thanks! Itai Roth

rothita commented 3 years ago

found out what was the problem: very long headers. I sovled it by adding condition in seqvec_embedder.py (line 125): for batch_idx, (sample_id, seq) in enumerate(batch): # for each seq in the batch if len(sample_id) >= (0x7fff * 2 + 1): sample_id =sample_id[0:(0x7fff)]