curl / curl

A command line tool and library for transferring data with URL syntax, supporting DICT, FILE, FTP, FTPS, GOPHER, GOPHERS, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, MQTT, POP3, POP3S, RTMP, RTMPS, RTSP, SCP, SFTP, SMB, SMBS, SMTP, SMTPS, TELNET, TFTP, WS and WSS. libcurl offers a myriad of powerful features
https://curl.se/
Other
35.9k stars 6.42k forks source link

"--max-filesize 0" doesn't seem to work as described #14440

Closed MasterInQuestion closed 3 months ago

MasterInQuestion commented 3 months ago

    Seemingly no-op, or null the limit.     Somewhat against: https://curl.se/docs/manpage.html#--max-filesize

    Use case:     curl -v --max-filesize 1 -L "https://github.com/mozilla-mobile/firefox-android/assets/38040960/fd50937d-5442-494e-b4aa-0baf75569a57"     .     Effectively doing HEAD but with GET:     [ ^ Alike what browsers do: https://bugzilla.mozilla.org/show_bug.cgi?id=1872503#c3 ]     Certain servers may refuse to serve HEAD (example reported HTTP 403 Forbidden), meanwhile the file may be large.

    Related:     https://github.com/curl/curl/issues/11810

bagder commented 3 months ago
  1. The documentation does indeed not mention the zero exception, I fix in #14443
  2. That Firefox bug does not say anything about this.
  3. You can do a GET without reading the body with curl -I -X GET https://example.com
MasterInQuestion commented 3 months ago

    Didn't manage to locate the "-X"...     Note "GET" is case-sensitive: "-X" just passes the string verbatim.

MasterInQuestion commented 2 months ago

    Why does it for this example seem to still download the whole file..?     https://drive.usercontent.google.com/download?confirm=t&export=download&id=1WxOrSi-GNB45nLUUiR4PT7c4H2VurtKk (~ 18.34 MiB)

    "--max-filesize 1" variant worked intended.

    See also:     https://trac.ffmpeg.org/ticket/11056#comment:16     https://trac.ffmpeg.org/ticket/11159#comment:3     ("confirm=t" needed to bypass some "virus" confirmation)

    ----

    More suitable to test:     -A "Mozilla/5.0 (Linux; rv:999) Gecko/20100101 Firefox/999" "https://premium.britannica.com/wp-content/uploads/2023/05/memorialday-2620x1080-1.png"     (~ 1.3 MiB)

MasterInQuestion commented 2 months ago

    @bagder, probably worth your attention.

bagder commented 2 months ago

What is? I don't understand what you're talking about.

MasterInQuestion commented 2 months ago

    Pardon.     Straightforward but less accurate:     curl -I -X GET -A "Mozilla/5.0 (Linux; rv:999) Gecko/20100101 Firefox/999" "https://premium.britannica.com/wp-content/uploads/2023/05/memorialday-2620x1080-1.png"

bagder commented 2 months ago

That's a curl command line. What about it?

bagder commented 2 months ago

You ask for -I (HEAD) and get you insist on -X GET which is highly confusing. What do you want it to do?

MasterInQuestion commented 2 months ago

    The question is:     Comparing the "--max-filesize 1" variant, this one causes the unwanted full-download.     (instead of mere getting the header)

    ----

    Rationale explained in 1st post:     "Certain servers may refuse to serve HEAD (example reported HTTP 403 Forbidden), meanwhile the file may be large."

bagder commented 2 months ago

OK, so what is the exact question?

MasterInQuestion commented 2 months ago

    How to:     Effectively doing HEAD but with GET, without full-download?

bagder commented 2 months ago

That is exactly what you get with:

curl -I -X GET $URL
MasterInQuestion commented 2 months ago

    That is exactly what you get me with...     [ Quote bagder @ CE 2024-08-07 15:05:06 UTC: https://github.com/curl/curl/issues/14440#issuecomment-2273695109     3. You can do a GET without reading the body with `curl -I -X GET "https://example.com"`. ]

    The problem is:     It seems to cause the unwanted full-download.

    Did it work (without full-download) for you?

jay commented 2 months ago

I also don't understand what you are asking. You want curl to behave as if it's receiving a HEAD response and close? What do you mean it causes an unwanted download? For example this download of 200MB should terminate immediately (after receiving the headers) if you tell curl it's a HEAD request but then change it custom to GET:

curld -v -I -X GET http://cachefly.cachefly.net/200mb.test -o NUL

The server sees GET and replies with the content but curl will terminate the connection after the headers.

It sounds to me like you want to simulate a HEAD reply for a server that does not support those requests but if you send a GET request to the server then it may send data before curl can close the connection. That's what you are asking the server to do with GET you want to get the resource. Correct me if I'm wrong @bagder but I'm pretty sure it's discarded as excess in such a case (ie not written to -o outfile) though I don't know if that's guaranteed

MasterInQuestion commented 2 months ago

    Compare:     curl -I -X GET "https://cachefly.cachefly.net/200mb.test"     curl -I -X GET "https://drive.usercontent.google.com/download?confirm=t&export=download&id=1WxOrSi-GNB45nLUUiR4PT7c4H2VurtKk"

    #1 also worked for me. (no notable download)

jay commented 1 month ago

As I have explained the server may send data before curl can close the connection. I took a look at your latter example in Wireshark and google takes approximately 3 seconds to reply with HTTP/2 HEADERS, I don't know why so long but it has nothing to do with curl. Then the server follows with DATA frames and during that entire time which is less than 1 second like 100 200 ms curl replies with RST_STREAM on the stream and then GOAWAY on the connection. You cannot expect no data will be sent because you are requesting the data is sent and curl needs to hang up after receiving the headers.

MasterInQuestion commented 1 month ago

[[     As I have explained, the server may send data before `curl` can close the connection.     I took a look at your latter example in Wireshark: and Google takes approximately 3 seconds to reply with HTTP/2 HEADERS.     I don't know why so long but it has nothing to do with `curl`.

    Then the server follows with DATA frames, and during that entire time which is less than 1 second like 100, 200 ms:     `curl` replies with RST_STREAM on the stream and then GOAWAY on the connection.

    You cannot expect no data will be sent: because you are requesting the data be sent.     And `curl` needs to hang up after receiving the headers. ]]     So for this case, the validity of "--max-filesize 0" seems to hold.

    Meanwhile I noted using "--max-filesize 1" with "-L", had caused those carp a redirection message of length:     To croak amid the redirection for "(63) Maximum file size exceeded".

    Workaround would be rising the limit to somewhat higher more tolerable value, e.g. "2K" (2,048 B).     [ I find "1500" works more pleasantly. Though a bit more bother to type. ]     Non-Plain-Text output will regardless not be output to terminal: unless explicitly requested via "-o -" alike.     When dealing with some extraordinarily small files: "/dev/null" alike may have to be bothered.

jay commented 1 month ago

I see, you are saying that --max-filesize applies to servers that redirect. Users of --max-filesize may want to limit the overall bytes downloaded even if it's specifically documented as file size downloaded, so I'm not sure that's a bug. What happens on redirect is curl is discarding the bytes like if the redirect is from localhost/foo to localhost/bar then it ignores foo download (* Ignoring the response-body) and downloads bar but it has to read the bytes of foo (which location redirects may have).

Anyone else have an opinion on whether this is appropriate behavior?

MasterInQuestion commented 1 month ago

[[     I see, you are saying that "--max-filesize" applies to servers that redirect.     .     Users of "--max-filesize" may want to limit the overall bytes downloaded:     Even if it's specifically documented as file size downloaded.     So I'm not sure that's a bug.

    What happens on redirect is:     `curl` is discarding the bytes like, if the redirect is from "localhost/A" to "localhost/B":     Then it ignores "A" download ("* Ignoring the response-body"), and downloads "B".     But it has to read the bytes of "A" (which location redirects may have).

    Anyone else have an opinion on whether this is appropriate behavior? ]]     Perhaps a separation: "--max-dsize"? (parallel of "fsize")

    The "foobar" non-sense is extraordinarily befuddling...     Normalized and I still couldn't quite understand.

bagder commented 1 month ago

Anyone else have an opinion on whether this is appropriate behavior?

The ignored response-body should not be counted as "file download" data. That should be a bug if it is. The max filesize should be for the data actually delivered/saved, not just transferred I think.

jay commented 1 month ago

you are saying that --max-filesize applies to servers that redirect.

Please take further discussion of that issue to #14899.

The "foobar" non-sense is extraordinarily befuddling..

They're placeholder names

MasterInQuestion commented 1 month ago

    I know, but anything involving "foobar" would be alike befuddling...

    https://github.com/MasterInQuestion/attach/raw/main/curl/curl/14440/foobar.webp

    .     Not just your writing.