bio-guoda / preston

a biodiversity dataset tracker
MIT License
24 stars 1 forks source link

gzipped content is automatically decompressed by embedded httpclient, altering the content requested #249

Closed jhpoelen closed 1 year ago

jhpoelen commented 1 year ago

steps to reproduce:

  1. move some preston archives with gzip-compressed content over to a apache web server
  2. access content with http client / preston client via apache web server

expected: the content is served as is, no decompression / compression

actual: for some reason, apache server config automagically detects gzip files and decompresses them

fyi @GregPost-ASU

example:

preston cat\
 --no-cache\
 --remote https://biokic6.rc.asu.edu/preston/gbpln\
 hash://sha256/90346c5616571af8fbacdd8449b5f04197c227ff4e8250443f1e76f649c21ec8\
 | head

yields plain text -

GBPLN987.SEQ        Genetic Sequence Data Bank
                            June 15 2023

                NCBI-GenBank Flat File Release 256.0

                         Plant Sequences (Part 987)

     572 loci,    38290762 bases, from      572 reported sequences

and we know that hash://sha256/90346c5616571af8fbacdd8449b5f04197c227ff4e8250443f1e76f649c21ec8 is a gzipped file.

GregoryPost commented 1 year ago

@jhpoelen - I was able to toggle this behavior client side by removing or including the Accept-Encoding header from the request.

Let me know if you can replicate on your end.

jhpoelen commented 1 year ago

For some reason, curl with and without Accept-Encoding header did not make a difference:

$ curl --silent -H "Accept-Encoding: gzip" "https://biokic6.rc.asu.edu/preston/gbpln/4f/06/4f06230f7d9d902ea67708a0e4eb1e5c8120a1f7b30d77260832ad5803a56e17" | gunzip | head -n2
GBPLN370.SEQ        Genetic Sequence Data Bank
                            June 15 2023

and

$ curl --silent "https://biokic6.rc.asu.edu/preston/gbpln/4f/06/4f06230f7d9d902ea67708a0e4eb1e5c8120a1f7b30d77260832ad5803a56e17" | gunzip | head -n2
GBPLN370.SEQ        Genetic Sequence Data Bank
                            June 15 2023
jhpoelen commented 1 year ago

however, I did notice https://github.com/spring-cloud/spring-cloud-netflix/pull/1591/commits/537792c130e8b9b29085de140ed61afdff2934bb , and found that disabling httpclient on my end was the likely culprit by automatically decompressing gzip content by default.

before disabling automated decompression -

preston cat\
 --no-cache\
 --remote https://biokic6.rc.asu.edu/preston/gbpln\
 hash://sha256/90346c5616571af8fbacdd8449b5f04197c227ff4e8250443f1e76f649c21ec8\
 | head -n2

yielded

GBPLN987.SEQ        Genetic Sequence Data Bank
                            June 15 2023

after disabling default content decompression:

preston cat\
 --no-cache\
 --remote https://biokic6.rc.asu.edu/preston/gbpln\
 hash://sha256/90346c5616571af8fbacdd8449b5f04197c227ff4e8250443f1e76f649c21ec8\
 | gunzip\
 | head -n2

yielded the following after including gunzip in the pipe:

Genetic Sequence Data Bank
                            June 15 2023
GregoryPost commented 1 year ago

I am curious what you see if you run: $ curl --silent -H "Accept-Encoding: deflate" "https://biokic6.rc.asu.edu/preston/gbpln/4f/06/4f06230f7d9d902ea67708a0e4eb1e5c8120a1f7b30d77260832ad5803a56e17" | head -n2

jhpoelen commented 1 year ago

$ curl --silent -H "Accept-Encoding: deflate" "https://biokic6.rc.asu.edu/preston/gbpln/4f/06/4f06230f7d9d902ea67708a0e4eb1e5c8120a1f7b30d77260832ad5803a56e17" | head -n2 $O�d�]Ks���W t;B�TeU>�:�A[������=) ���.HzF��7�@SТKS}�H� �F�CuV�??z��e�pxr�����ϋ��zy6;Y���bu��=��g���f���f��E�A�t�w��Ǐ���w���~���~Z^,fNj���j1�ð�����������w������Cqv�>[��a���������������r��^�Ϯ�������W�ߜ�^��X@b��/g ?���������O�^>;}���l�t�9�|��qs�~�Y̮n�]]���~�Z\��WW���.>ݟ�}P��+}��Y~zx���㣓�z�;l�8:�}����ɣ��^���h�~�Y����������%���&�%'���N�8zH9!`��;:8^��g7g���G��I�$ �?�����q�����5yg�;k�Ys�>�sM���<���x���<���x���<��G�xd�|��;߅��|^���0�~wִ�杵4������w��²�G�xv����)�1��Zv�e�>�>�b(%+��Yo�C����~=���xu����Ã�Wo�mo�o��w���΍��o.��WT��~���ɋw�/�����ͧ���o�,7���1���o��k�I�����?�>���n9���>,n��X^.6�w�z1�Z,חW�s�������z~����g �{�^���Vg����zy^/8�,�K��?�����.Ww˛���G?�����ߜ>}ul[}r�a}����WpW�-6���se��w�������N�oƓ�FY���������r����E��ˇ��n���=������g�^�����x6W�w��}ci��okۛ�ٕ��/��?�o���.�>覭�+�}u:��6�כ�buuu�1�<SB{�C�L��C����\Z�<���So��t��ՋG/O�%?o������ߒ��=�!��|N7��?T���ɍ>��ӛ���fg�¼9����������bvz����?g��˅ )���K���o�[������G�˭�8<�����㣓�+g��gu��������n��VW뛍~���xxx',���כ������{?=������x{��r��^����8n]��|{u�8[.���77>��_��o����^�#(��?K����۫�|�o�ϫ���𰤔�w1��Ѹ�ۋ������ҸToy��oV����\�����}�~�v�g%9�@�}l� �J�ȽPs��J��C�[j����

GregoryPost commented 1 year ago

So that returned the raw compressed file - I'm guessing that curl has "Accept-Encoding: gzip,..." as a default unless it is explicitly overridden.

jhpoelen commented 1 year ago

After deployment of preston v0.6.5, the following expected behavior is observed:

preston cat --no-progress --no-cache --remote https://biokic6.rc.asu.edu/preston/gbpln hash://sha256/90346c5616571af8fbacdd8449b5f04197c227ff4e8250443f1e76f649c21ec8 | gunzip | head -n2
GBPLN987.SEQ        Genetic Sequence Data Bank
                            June 15 2023

and

preston cat --no-progress --no-cache --remote https://linker.bio hash://sha256/90346c5616571af8fbacdd8449b5f04197c227ff4e8250443f1e76f649c21ec8 | gunzip | head -n2
GBPLN987.SEQ        Genetic Sequence Data Bank
                            June 15 2023

where linker.bio is proxying the BioKIC server using Preston v0.6.5 .

@GregPost-ASU thanks for being patient in helping to troubleshoot this funky issue.

I still don't quite understand why I wasn't able to reproduce the curl commands. Am closing issue for now, as the desired behavior is observed after changes were applied.

Please do feel free to comment /re-open if you feel more work is needed.