Closed GoogleCodeExporter closed 8 years ago
What about using the 'tarfile' library?
<http://docs.python.org/lib/module-tarfile.html>
Then you can just use the:
extract()
or
extractall()
member functions to de-archive the wanted files
Original comment by bad...@gmail.com
on 21 May 2008 at 4:15
Hi Namshin,
I'm not sure I understand exactly what you mean. Is the problem
1. how to get the contents out of a tar archive file?
or is the problem
2. how to extract valid FASTA sequence from mis-formatted files AFTER they have
been
successfully extracted from a tar archive?
Since downloader.py does use the Python tarfile module to extract tar archives,
I
assumed that the problem must be #2, but after reading your comment above I'm
not so
sure. Those "extra characters" look like what you'd see in a tar archive
header...
If the problem is #1, it should be easy to fix -- we have the tools for
extracting a
tar archive! Currently downloader.py should automatically untar any file that
ends
in .tar, .tgz, .tar.gz, .tar.bz2. If you have a case where a tar archive is not
being untar'ed properly, please give us both
- URL for the download file that fails to untar properly
- stacktrace showing error message if any
Also, downloader.py does not use the zlib module, so I don't understand what
you mean
by "If we want to read that files in python zlib module, there is a problem".
Please
explain.
Thanks!
Chris
Original comment by cjlee...@gmail.com
on 21 May 2008 at 11:50
Hi Chris,
It is #2. You can login biodb.bioinformatics.ucla.edu and
check /Users/deepreds/projects/test directory. Those are the output files
generated
by my downloader script in /Users/deepreds/projects/src. As you can see, some
of
the .zip files were not deleted. And, if you see first line of mm8, mm9 you can
see
what is going on in those output files by downloader.py
Yours,
Namshin Kim
Original comment by deepr...@gmail.com
on 22 May 2008 at 12:08
Hi Namshin,
did you use the singleFile=True option, which instructs the downloader to
extract all
the data to a single file, as would be required for a FASTA database?
e.g.
s =
SourceURL('ftp://hgdownload.cse.ucsc.edu/goldenPath/anoGam1/bigZips/chromFa.zip'
,
filename='anoGam1.zip', singleFile=True)
-- Chris
Original comment by cjlee...@gmail.com
on 22 May 2008 at 3:16
Actually, this whole issue was due to tarfile.read() bombing due to using the
wrong
mode ('r|gz' instead of 'r:gz'). I didn't realize tarfile.open() had two
different
sets of modes, listed in two different tables in the documentation!
Specifically,
tarfile.read() was crashing like this:
>>> filepath = downloader.uncompress_file('chromFa.tar.gz', singleFile=True)
untarring chromFa.tar.gz...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/leec/projects/pygr/build/lib.macosx-10.5-i386-2.5/pygr/downloader.py",
line 94, in uncompress_file
return do_untar(filepath,mode='r|gz',newpath=filepath[:-7],**kwargs)
File "/Users/leec/projects/pygr/build/lib.macosx-10.5-i386-2.5/pygr/downloader.py",
line 69, in do_untar
copy_to_file(f,ifile)
File "/Users/leec/projects/pygr/build/lib.macosx-10.5-i386-2.5/pygr/downloader.py",
line 11, in copy_to_file
s = f.read(blocksize)
File "/sw/lib/python2.5/tarfile.py", line 748, in read
buf += self.fileobj.read(size - len(buf))
File "/sw/lib/python2.5/tarfile.py", line 666, in read
return self.readnormal(size)
File "/sw/lib/python2.5/tarfile.py", line 673, in readnormal
self.fileobj.seek(self.offset + self.position)
File "/sw/lib/python2.5/tarfile.py", line 487, in seek
raise StreamError("seeking backwards is not allowed")
tarfile.StreamError: seeking backwards is not allowed
I'm not sure why Namshin missed this error message. In general when pygr.Data
fails
to load a resource, try loading it with via pygr.Data.getResource(name,
debug=True)
which will raise all exceptions, rather than hiding KeyError and IOError
because they
are signals that a given resource database cannot provide this resource (so it
just
goes on to try the next resource database). However, tarfile.StreamError is
not a
subclass of KeyError or IOError, so it should have been raised no matter what.
You can also test it outside of pygr.Data like this:
from pygr import downloader
import pickle
src =
downloader.SourceURL('http://biodb.bioinformatics.ucla.edu/GENOMES/apiMel3/chrom
Fa.tar.gz',
'apiMel3.tgz', singleFile=True)
s = pickle.dumps(src)
filepath = pickle.loads(s) # this triggers the download and uncompress
from pygr import seqdb
db = seqdb.BlastDB(filepath)
s = db['Group1']
print len(s) # 25854376
print str(s[:10]) # 'agcctaaccc'
I pushed the fix to the public git repository.
Original comment by cjlee...@gmail.com
on 22 May 2008 at 3:47
I may miss the error message because download status messages are TOO LONG. It
prints out one line per 0.1% progress. Oops...
Original comment by deepr...@gmail.com
on 22 May 2008 at 5:01
Original comment by mare...@gmail.com
on 21 Feb 2009 at 1:28
Hi Namshin,
please verify the fix to this bug that you reported, and then change its status
to
Closed. We are now requiring that each fix be verified by someone other than
the
developer who made the fix.
Thanks!
Chris
Original comment by cjlee...@gmail.com
on 4 Mar 2009 at 8:49
Original comment by mare...@gmail.com
on 13 Mar 2009 at 12:52
Original comment by deepr...@gmail.com
on 22 Mar 2009 at 9:32
Original issue reported on code.google.com by
deepr...@gmail.com
on 14 May 2008 at 12:03