Open MattBrauer opened 1 year ago
Hi @MattBrauer can you help us replicate your scenario? You mention using pandas' read_csv
method, but I get an error when using bgzip
as the compression value.
ValueError: Unrecognized compression type: bgzip
Valid compression types are ['infer', None, 'bz2', 'gzip', 'xz', 'zip', 'zstd']
Hello. The bgzip
format is compatible with gzip
, so pandas' read_csv
can be used successfully with gzip
compression. I'd like it to be possible to do the same with awswrangler.
OK, so you are using pandas.read_csv(.., compression="gzip")
after copying file to local today?
That is correct. That's a reasonable workaround, but it would be nice to have wr.s3.read_csv
behave the same way, if possible.
Thanks for the attention on this. If necessary I can get you a file that demonstrates the problem, but since they contain restricted data I'd have to do some obfuscation work first.
Geneticists and genomics scientists cannot natively get data from S3 when it's compressed using the most common format for that data type Tabular data in genetics and genomics is often compressed using the so-called "tabix" format. This is a block compressed, gzip-compatible compression that allows indexing into a file by genome position. While the file suffixes are
.gz
, andgunzip
can be used to decompress them from local files (viapandas.read_csv
withcompression=gzip
) AWS data wrangler cannot fetch these data from S3 (byawswrangler.s3.read_csv
).What I'd like to see I'd like to be able to get data via
awswrangler.s3.read_csv(S3_uri, compression="bgzip")
. Whileguzip
will decompress the file on local storage,bgzip -d
is actually the preferred method. I believe that subtle differences betweengzip
andbgzip
corrupt the reading of the data from S3.Alternatives Transferring a file to local storage (
aws s3 cp <uri> ./
) and then using the pandasread_csv
function works, but involves an extra copy step.It is likely that the genomics community's use of
bgzip
for tabix' files is idiosyncratic, but there is a large and growing number of users in this space.Adding support for this compression method would support the genetics field and the biopharma industry.