MetricsGrimoire / MailingListStats

Mailing List Stats is a command line based tool used to analyze mboxes
http://metricsgrimoire.github.com/MailingListStats/
GNU General Public License v2.0
38 stars 25 forks source link

Error when analyzing rdo-mailing list #11

Closed dicortazar closed 10 years ago

dicortazar commented 10 years ago

When running mlstats against this mailing list: https://www.redhat.com/archives/rdo-list, the tables are not filled.

One odd issue here is that the directory that should contain the compressed files, contain the uncompressed files but still with the original file. This is that files still conserve the original name ending with .gz, but if the command "file" is used, they are ASCII.

Output of the tool:

[13:11:14] /usr/local/bin/mlstats --no-report --db-user="xxx" --db-password="" --db-name="database" --db-admin-user="root" --db-admin-password="" "https://www.redhat.com/archives/rdo-list" Already downloaded https://www.redhat.com/archives/rdo-list/2013-April.txt.gz Already downloaded https://www.redhat.com/archives/rdo-list/2013-May.txt.gz Already downloaded https://www.redhat.com/archives/rdo-list/2013-June.txt.gz Already downloaded https://www.redhat.com/archives/rdo-list/2013-July.txt.gz Already downloaded https://www.redhat.com/archives/rdo-list/2013-August.txt.gz Already downloaded https://www.redhat.com/archives/rdo-list/2013-September.txt.gz Already downloaded https://www.redhat.com/archives/rdo-list/2013-October.txt.gz Found substring 2013-November in URL https://www.redhat.com/archives/rdo-list/2013-November.txt.gz... Retrieving https://www.redhat.com/archives/rdo-list/2013-November.txt.gz... Unknown URL or directory: https://www.redhat.com/archives/rdo-list. Skipping. 0 messages analyzed 0 messages stored in database acs_mlstats_redhat_rdo_2437 0 messages ignored by the parser

In addition, removing the "try expect" at https://github.com/MetricsGrimoire/MailingListStats/blob/master/pymlstats/main.py#L300 the new error is the following one:

Traceback (most recent call last): File "/usr/local/bin/mlstats", line 37, in pymlstats.start() File "/usr/local/lib/python2.7/dist-packages/pymlstats/init.py", line 154, in start web_user, web_password) File "/usr/local/lib/python2.7/dist-packages/pymlstats/main.py", line 145, in init t,s,np = self.__analyze_mailing_list(mailing_list) File "/usr/local/lib/python2.7/dist-packages/pymlstats/main.py", line 297, in analyze_mailing_list archives_to_analyze = self.set_archives_to_analyze(mailing_list, archives) File "/usr/local/lib/python2.7/dist-packages/pymlstats/main.py", line 433, in __set_archives_to_analyze mailing_list.mbox_dir) File "/usr/local/lib/python2.7/dist-packages/pymlstats/utils.py", line 137, in uncompress_file files = [extractor.gzExtraction(new_filepath)] File "/usr/local/lib/python2.7/dist-packages/pymlstats/fileextractor.py", line 71, in gzExtraction outputfileobj.write(gzipfile.read()) File "/usr/lib/python2.7/gzip.py", line 254, in read self._read(readsize) File "/usr/lib/python2.7/gzip.py", line 296, in _read self._read_gzip_header() File "/usr/lib/python2.7/gzip.py", line 190, in _read_gzip_header raise IOError, 'Not a gzipped file' IOError: Not a gzipped file

dicortazar commented 10 years ago

mm sorry for the too-generic-title of the issue u_u

gpoo commented 10 years ago

It is not odd if you consider that the content can be uncompressed on the fly (i.e. if you open the file with a web browser).

I think the issue is in: https://github.com/MetricsGrimoire/MailingListStats/blob/master/pymlstats/utils.py#L77

If the file comes uncompressed and ends in .gz, it tries to compress it; to uncompress it later. However, there might be more corner cases.

sduenas commented 10 years ago

I think it's not a bug in mlstats. When I try to download the files from that mailing list using other tools such as curl or wget the server closes the connection and I get a corrupeted file.

Using curl:

LANG=C curl https://www.redhat.com/archives/rdo-list/2013-September.txt.gz > mbox
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
 99  2550   99  2549    0     0   3383      0 --:--:-- --:--:-- --:--:--  3380
curl: (18) transfer closed with 1 bytes remaining to read

The downloaded file is in plain text and doesn't contain all the messages for that month.

Downloading the file with wget it has to retry once and the file is also corrupted. The first part of the file is in plain text and the rest seems to be compressed.

sduenas@Guybrush:/tmp/rdo$ LANG=C wget https://www.redhat.com/archives/rdo-list/2013-September.txt.gz
--2013-11-28 19:11:44--  https://www.redhat.com/archives/rdo-list/2013-September.txt.gz
Resolving www.redhat.com (www.redhat.com)... 2.20.215.214
Connecting to www.redhat.com (www.redhat.com)|2.20.215.214|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2550 (2.5K) [application/x-gzip]
Saving to: '2013-September.txt.gz'

99% [=====================================================================================================================================================================================> ] 2,549       --.-K/s   in 0s      

2013-11-28 19:11:45 (470 MB/s) - Connection closed at byte 2549. Retrying.

--2013-11-28 19:11:46--  (try: 2)  https://www.redhat.com/archives/rdo-list/2013-September.txt.gz
Connecting to www.redhat.com (www.redhat.com)|2.20.215.214|:443... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 186553 (182K), 184004 (180K) remaining [application/x-gzip]
Saving to: '2013-September.txt.gz'

100%[++====================================================================================================================================================================================>] 186,553      701KB/s   in 0.3s   

2013-11-28 19:11:47 (701 KB/s) - '2013-September.txt.gz' saved [186553/186553]

I also tried with the mboxes from https://www.redhat.com/archives/rhos-list/ getting the same weird behaviour.

I'm gonna close and label it as invalid.

gpoo commented 10 years ago

Shouldn't mlstats trigger an error pointing the request rather than trying to open a corrupt file?

sduenas commented 10 years ago

@gpoo Totally agree. I will try to fix it ASAP.

justinclift commented 10 years ago

As a data point, this works:

$ curl -r 0- -O http://www.redhat.com/archives/rdo-list/2013-April.txt.gz

The "-r 0-" gets curl to request the full range of the file, and does it a bit differently such that the complete + correct file contents are sent. No idea why the weird behaviour occurs without it though.

justinclift commented 10 years ago

@berrange has pointed out that this works too:

$ wget --header="accept-encoding: gzip"  http://www.redhat.com/archives/rdo-list/2013-April.txt.gz
gpoo commented 10 years ago

Thanks @justinclift for provinding more information, I re-opened the bug.

dicortazar commented 10 years ago

With the change, RDO mailing lists still fails.

Please check with https://www.redhat.com/archives/rdo-list/

However, adding that header at https://github.com/MetricsGrimoire/MailingListStats/blob/master/pymlstats/utils.py#L106 makes that work.

But, if you add such header there, a typical mailing list that works, fails... such as http://lists.openstack.org/pipermail/foundation/

justinclift commented 10 years ago

When you say "with the change", what are you meaning?

dicortazar commented 10 years ago

Oh sorry @justinclift, I was pointing the commit [1] by @gpoo fixing this issue.

With that change to the source code, I still reproduce this error (#11).

[1] https://github.com/MetricsGrimoire/MailingListStats/commit/9015abd83a96d2eaf9492379889a40315cfb34b2

justinclift commented 10 years ago

No worries. If it's any help this command retrieves all of the RDO compress mbox files 100% correctly (for me anyway):

$ for list in 2014-May 2014-April 2014-March 2014-February 2014-January 2013-December 2013-November 2013-October 2013-September 2013-August 2013-July 2013-June 2013-May 2013-April; do curl -O -r 0- http://www.redhat.com/archives/rdo-list/$list.txt.gz; done

Tested using both curl 7.21.4 and 7.36.0. Both worked fine.

justinclift commented 10 years ago

I guess you'd need to find the urllib2 way to request a range. :frowning:

dicortazar commented 10 years ago

Thanks @justinclift . However, we need something more generic. If a new archive is added (for instance new months), we would need to automatically update that list and not in a manual way.

In any case, as you mentioned, this needs to be fixed through the urllib2 library.

justinclift commented 10 years ago

No worries. Try this: https://github.com/MetricsGrimoire/MailingListStats/pull/26

Adding that header worked for me through python shell. It should work for the RDO list too. Hopefully it won't stop the other mailing lists from working.

gpoo commented 10 years ago

I don't think that solves the issue. Without that, htmlparser.py works for me. See the following snippet:

import urllib2
import gzip
import cStringIO
import sys

def retrieve(url):
    user_agent = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.2 ' \
                 '(KHTML, like Gecko) Ubuntu/11.04 Chromium/15.0.871.0 ' \
                 'Chrome/15.0.871.0 Safari/535.2'
    headers = { 'User-Agent': user_agent,
                'Accept-Encoding': 'gzip, deflate' }

    request = urllib2.Request(url, headers=headers)
    response = urllib2.urlopen(request)

    data = response.read()

    if response.info().getheader('content-encoding') == 'gzip':
        data = cStringIO.StringIO(data)
        htmltxt = gzip.GzipFile(mode='r', compresslevel=0,
                                fileobj=data).read()
    else:
        htmltxt = data

    response.close()
    return htmltxt

if __name__ == '__main__':
    print retrieve(sys.argv[1])

If you call it test.py and try: python test.py http://www.redhat.com/archives/rdo-list/2013-April.txt.gz, it will retrieve the document correctly.

I missed that utils also uses urllib2 to fetch documents. htmlparser.py it is just part of the issue. Fixing utils.py should be straightforward, though.

gpoo commented 10 years ago

@dicortazar You need more than adding the header. You need to process the request if coming gzipped. Or did you already tried that one?

justinclift commented 10 years ago

@gpoo Hmmm, it's not liking that (test.py) on a non-RDO mailing list:

$ python test.py http://lists.openstack.org/pipermail/foundation/2012-March.txt.gz > 2012-March.txt.gz
$ gunzip 2012-March.txt.gz 

gzip: 2012-March.txt.gz: unexpected end of file
gpoo commented 10 years ago

@justinclift You should do:

$ python test.py http://lists.openstack.org/pipermail/foundation/2012-March.txt.gz > 2012-March.txt
$ more 2012-March.txt

The content is already decompressed.

justinclift commented 10 years ago

@gpoo What's the filesize on that 2012-March.txt.gz when it comes out of test.py for you?

$ ls -la
total 336
drwxr-xr-x   5 jc  staff     170 12 May 19:02 .
drwxr-xr-x  19 jc  staff     646 12 May 18:52 ..
-rw-r--r--   1 jc  staff  162823 12 May 19:02 2012-March.txt.gz
-rw-r--r--   1 jc  staff     839 12 May 18:52 test.py

For me, it's not already decompressed.

$ more 2012-March.txt.gz 
"2012-March.txt.gz" may be a binary file.  See it anyway? y
^_<8B>^H^H^E[<F6>O^B<FF>2012-March.txt^@<EC>}i<97><DB>Ʊ<F6>g<E3>Wt<E6>8W3^NIq<9B>5^NG<B3>Ʋ%K<91><E4>(ysr<EE>^AI<90><84>E^B4@^NE^?<B8><BF>
<FD><AD><A7><AA><BB><D1><E0>2ESC![<B6><E9>8<B6>E^R<8D>^ky<EA><A9><EA><EB>$^^<A9><91><9F>|P<FE>D<C5><E3> J'~<E7>C%N<FA>J<BD>^O<BA><U+A95F>(u
<A8><AA><8D><93>z<F5>d<BF><A6><EA><D5>Z<U+077B><A6><87>N<D6><<B5><FB>^R^__<C4><C3>a^X${ޥ?   NԻiPR^G<DC>^X^ZP<F5><BA>n<AE>\<U+076F>V<D5><EE>
<D5><DB>w{<DE><DB>i<FB>Ǡ39Q<FF>yE-<BE>E<8B><EA>:<9E>F]^?^R<C6><D1>^?<D5>^Oc<FA><AF>@Mb5N<E2>q<9C><FA>C^UG<CE>/T:I<A6><9D><C9>4   <BC><97>A
<9A><FA><FD><A0><FC><FC><F2>D}]k4j<D5><E3>js<BF>V<A9>^^^^5j<F5><E3><E6><F1><B3>`<E4><87><C3>4<B9>I*<9D>x<D4><F2><BC>w^C?<FA><90><A2><F1><E0>&H<E6>q^T<A8><D9> V^C?U<9D>8<9A>$a{:<A1><E9><A0><EF>'<83>@uô3MS<BC><94><FE><C6>'<AB><BA><A1>bj<8A><BF>^]<FB><E9>D<U+14265A>^E<C1><87><B4>
<A2><CE>R<F5>\<A5>A<A0><C2>I<89>^?<D0>^K<82>nESC^C<C6>^K{<FE>p^XD*<8C><F0><B6>YL^S^]F<AA>=<ED>&i<B9><EC>4<DF><E3>g<CF>c?<E9>*?<EA><D2>^O?^Dj:6_<BC>^K:<83>(<EC><D0><]ģQ8<99>^DA<B9><EC>^O㨯f<E1>d<A0>|<EE>RL?<A5>^<FA><9D>        ~Z<A2>^WMT8^Z<C7><C9>ď&<AA>ESCL0Q<D4>^^<AD><B4>O/<A5>
<B9>o<FB><ED><E1>\<B5>^Cj0Q<B3>8<F9>@^S^S<E3>)<99><8A>(<F8>8Qc^ZHP<F1><BC>W<91><D3>Ǭ<EB><F8>l^TӤ<D0><E4><A6>!O<8E><9E><80>Y<A0>^F^A~<<F3>y9xL<E1>D^E~JESCJ<F5><E2><84><FA>=<8A>^S,^BMo^Z<D0><N0b<DA>|~^T<FE><CC>k<C0>O<B6><E9><B9><E8>&^^<DE>P<F7>B<A7>ESC^U<F5>.k<96>[<F2>;^]<DA>0a{H
<BB>^T<EF><F7>o^B<D5>I^B^_+<EE><D3>xfj^T<8C><DA><F4><B2>A8V<9D><A1><9F><A6>'<B4><84>i<DC>    <B1>!_<CA>w<95><E5><8F>hr<A6><C3>.<AD><FE>\<A5>a?
{4<C3>ф<E6>n^X<CF>0<96> HK<CA><EF><FE>8M<F1>&^L<8D>6<E3>؏<F0><F3><9F><A9>/XS<9A>^E_ީ<82>!^]^O<85><AD>9<A6>Ʊ<F1>bgr<F1>x<E0>w^F<EA><ED>$<A1>>
<F4><C3>^N͌?<A9><A8><E7>^Q<CD>T<U+061D><D2>><C8><F7><CB>^_<A6><F1>cESC}<C7>^K=[ZXi^Y[*<E8>aji^QjOESC<F2>^V^Zc{<BE><A2>;<A5>ş,M<A4>L^E~<E5>
<8F><C7>1^]^L<F9>]<D6>)<BB>^F<E8><D7>,<FC>^P<F2><F6><99><B2><C4><E0>SK<DD><E1><81>N^F!-<DE>`2^Y<9F><}<8A>^_Vr<E2><EB><E9><DF>qf#?<EA>^DO<B3>#<FD><F4><AD>^]<9F><E7><BD>^_<84><C3>@F<C8>^GBΏ<CC>C<ֻ^O<A7><85><B7>Q;<A0><93><DC>ESCƲ<97>^R4<A9>v<E9>|ӎ4<D3>@<A3><8B>b<F7><CF><X^VC<D4>[:
<AC>8<FD><C1><84>DG<B4>W<E2>^A<A8><C9>|<CC><C7>^?^T~^L<DC><D9>O'!5M<82>e<C4>'<80>{<80><C5>l'<B1><DF>^M蘥<D3>1Nv^E^B<CF>^L<80><B7>^AfA<FD>4
<A5><DF><F0>^@<FC>6^N4<9F>R^R"t<AE>:<E9>-R<A5>D<ED><D2>^^<C1><81><A5>^_$<F1><90>;<F7><FA><U+074B>^T<E2>^Q2n<86>      <F9><89>^FGg^@ESCp7<EC><D1>(<E7>{"R<86>~^G"^Dg4NE<F0><CE><C2>t<80>!<DC>ē<C0><9C>^<DE>!<E8>^I^E<93>N<85><DE>*[5[<9D>n<E2><F7>&
<E7><89>^OxJ^S;<D1>r<92><DE><DF>#^Y<C7>;:<8A>gt,0<95>^Q^K<9C>t<C0><BB><F6>C^P<8C>嗎|<EF><C7>z   <B8>^G^_IV<E1>ϓ^A<89><87>nj<A4><FF>$^^<D3>&IJ<BD>ESC<D0><F8><92><F9><93>T<F7>E?<C8>;<8D>^T<AA>"^A <FD>^K<A3>N<9C><D0>Z<B0>J<E3>wƝ<E9>(<88>&Z<EE><E7><C6>E3K<B2><AA>O<F2><9F>D^AMQ"<F2>^M=<E9><D2>ѕ<E5>t^D/<BD><8F><F6><C1><87><90><E4>iܓe<BB>Dw<F0><80>]5<85><B7><D1>^PE<<FA>$ك<9B>0<9E><92><9C><81>nDo<9F>P#t<80><C7><D3><F6><90>
<96>^C^O<FB><90>P<F4>8wt(^Z<80>w$^T<9B><96>W^_<F0>C^Qda<A0><CF>^B<8E>l<U+0515>^<D3>W<81>^Q<A8><F8>^D<A2>Q<A4>^N<FD>0<E8><F5>h<91>y<AA>D<94>
^L<C3>^O<AC><F3>!<E5>i<EE><FC>~?<81><AC><A6><8D>ݥ^U^X<86>$<B8>v<87>,*<FC><A4>3<D8><E3><F5>E<8B>>&.<85><94><C0><BB><E9>?q^@<B3>^<A5><FA>}3Y^R<BD><F4><AC>ޠ^KH/<DC><D0>0H<A6><F7>&Z<87><9B>ו<D6>M'<CF>9M^W<F5>`D<EF>b^U˲sL<9B><90>f<9A>N̢<C0><A2><D9>bŀ<99><A5><B1>M&q^R^Es^L^W_<D0>.9ESC
<86><9D>@}<E7>l<C1><AE>y<B1><U+EA48><BC>N<BE>^W<F8><B2>;<ED><E8>_ۣ;^L<FA><D4>  <B3>ь<92><D4>˛iI<96>^E<98>G<92><F1>XV<BC><80><9E><E1><B7><D1>
<FE><CA><DA>0<A3><F6><E9><A4><C4>]9<CB>h<98>6<AB><D9><C9>8<D3>=<9A><B9><C4>X^Syya<A4>^Nz9<88><E9>uZ<98><CA>^V<A1>7cESC<D0>ESC<92><80>^V^Er"
^Qq2 ]@<A2><DB><F4><BA><83>}^_<F7>x";?M<C3>4<94>F^Sj(<E9>SCZ<86><AA><EC><DC><E9>MH<C7>G<F6><FA> <A6>7͂v^X<F9>    <9F><ED>w<83>i<D2>M<BB><A4>
<BF><FF>G]'!<FE>#<DB><U+DEFC><E3><D4>^Q}w<BC><C7>C<F6>Y^U<F8><DD>.<BF><9A><A6><DA>X5^Uk`<FA>}^Xs<AC><EA><FD>^D<CB>HESCqēa%noj<AC>;k^G<FF>
<89><F4>^N<EC>j<EF>Y:<A6>^?<CD>;b^{<9D>`8<Q<FB><B5>z<F9><F0><B8>V<AE>6<F6>^O<<8F><AD><F3>u&=^MBL<FA>c<98><F4><8D><DA>^F&<FD>`Z<A2><B1><BB>&
<BD>4W<A4>I<EF>}<91><D9><F4>ϣ<F2><9B>`<<9C><97><DF><C5><F7>4<EA><DF>^D=:<F8><A4><CA><D3>{><B0><E4>6<D4>^O^ZǵZ<85>^^<AC><D7>^O<F6><8F>kKn<C3>7tZ<E7>OH<9F><C3>p|<C2>B<A9>^Sv<C5><E4><E0><DD><C4><C2><DF><EC>(^Vмw<B0><B7>h<DB>cC<D1>t7<C7>#աӔ<D0><F8>q<D8><C8><F1><8A><EC><97><C7>~<F6>%
<B5><DA><F1>'^P)A<^^<B2>b<EC><86>=^^%I~?Ɏ:<99><E5>C6^^g0<94>I<8C><C2>^C<99><84><A3><A0>^B<BF><EE> <ED><AF>^A<E9>,h29<AA>>٤lF<A3><95><B0>^S
<8E>ŗ<C1>0<C8>^Z<CF><CC><F3>^^6W/^^~pm^D<C8>_j^4^^<EB>SR<EB>]<E7><A8>,;IV<B7><89>i<A7>^U<AE><CC>^P<89><EA><B1>><94><D4>OQ<C0>"<AA>&<C9>\<B4>^PL<CF>8b<E9>?
^BQ<C9>q<<D4><D6>^Q^M<93>扜<A8><AE><F9><B6>=%<C1><9A><90><DC>!<B5>ESC<C0>*p<DC><^VCI<C8>?g<A1><CD>F^LI<FB><F9><C4>^_~P<E9><9C><FA>3°<A6>i
<90><9F><C8><E7><F8>ِ<EC>y6<B4>`đ<92>`^Y6<A3><83><99>^?<B7>^HJxK<E2>b<91><F2>$#<A1>^OE^Vb<FA>ih<^S$(ٛ"<D5><C3><EE>^Y<9B><80>,<AA>a˜x^mO<A9>^_"V<8B>oL'Kj^^OI(<AB><EF>_<BD><A3><B9><91><FD>ׅ<C6>a^Q<EC><93>^A^Xv<FB><81><BA>       I<CD>^O0j<B3><B0>"9<87>C<B8>    l <95><B2>9<E8>%^A<FB>F
<B4>jXV,^F<D6>Y<EC><9F><C0><B4><C9>㢵<98><F3>OG1<9F><E8>ھ,@<U+009A><96><B6><AA><CF>J<83><FB><97><F5><CB>i%<9A><B2><B0><9F>FCh_<8C><C6>o<A7>
$
gpoo commented 10 years ago

This is what I get:

$ python test.py http://lists.openstack.org/pipermail/foundation/2012-March.txt.gz > 2012-March.txt
$ ls -l 2012-March.txt 
-rw-rw-r-- 1 gpoo gpoo 1370 May 12 11:07 2012-March.txt
$ head -3 2012-March.txt 
From nobody Mon May 12 11:07:16 2014
From: jgascon at gsyc.escet.urjc.es (Jorge Gascon Perez)
Date: Wed, 14 Feb 2007 19:46:10 -0000
``
justinclift commented 10 years ago

Hmmm, it's weird that it's saying "unexpected end of file".

$ gunzip 2012-March.txt.gz-test.py

gzip: 2012-March.txt.gz: unexpected end of file

There's a 1 byte filesize difference compared to the curl method:

$ python test.py http://lists.openstack.org/pipermail/foundation/2012-March.txt.gz > 2012-March.txt.gz-test.py
$ curl -r 0- http://lists.openstack.org/pipermail/foundation/2012-March.txt.gz > 2012-March.txt.gz-curl
$ ls -la
total 656
drwxr-xr-x   6 jc  staff     204 12 May 19:07 .
drwxr-xr-x  19 jc  staff     646 12 May 18:52 ..
-rw-r--r--   1 jc  staff  162822 12 May 19:07 2012-March.txt.gz-curl
-rw-r--r--   1 jc  staff  162823 12 May 19:06 2012-March.txt.gz-test.py

This is confusing.

justinclift commented 10 years ago

@gpoo It's probably due to different version dependencies or something. Not real sure.

I'll leave it to you. :smile:

gpoo commented 10 years ago

No, my bad. I was in the wrong directory trying another test.py file :-P

gpoo commented 10 years ago

I see the issue, in the script I was using print. Anyway, if you change that for something like:

if __name__ == '__main__':
    with open('output.gz', 'w') as f:
        f.write(retrieve(sys.argv[1]))

should work. Beware with the name, it could be misleading depending of the content negotiation.

justinclift commented 10 years ago

No worries, I'm sure you'll get it working. :smile:

(I've already moved on to other bits on my Task list for today)

sduenas commented 10 years ago

@gpoo can you review this pull request? This one fixes the two problems that @dicortazar commented above:

For further changes I have two proposals:

jgbarah commented 10 years ago

WRT Requests, I guess the main difference with urllib2 is in fact its more "reasonable" API. However, that would amount to adding another dependency to mlstats, since urlib2 is a part of Python standard libs, and Requests (AFAIK) is not. That's not a big issue, but well, i just wanted to mention it.

gpoo commented 10 years ago

At some point I dismissed Requests for the same reason explained by @jgbarah and because it is used only twice, which could be wrapped in once eventually. I don't have a strong opinion, though.

Regarding to magic numbers to identify the file type, I agree. I had the same idea, though for different purpose (just store the mboxes compressed and uncompress them on the fly, instead of duplicating the space used in ~/.mlstats).

Beware, that you can be grabbing a file name foo.gz, but receiving it deflated one, and viceversa.

I will check the patch in some hours, a first look at it seems ok though.