Closed yarikoptic closed 6 months ago
I wonder if there is information about agent like for http requests? I just see something slowly going through the instance of dandidav. But if it is some bot -- we might even ban/provide agents configuration to avoid crawling it.
@yarikoptic Since dandidav
is behind an Apache(?) proxy for HTTPS purposes, by default it'll only have the IP address for the proxy. In order for the actual client IP address to get to dandidav
, the proxy server will need to send an X-Forwarded-For
or Forwarded
header in its requests to dandidav
(and this will also need to be set up as part of the Terraform deployment).
However, the proxy server's access logs should already include the client IP address for requests to dandidav
, so I'm unclear why you need dandidav
to log this.
I wonder if there is information about agent like for http requests?
I assume that you're aware of the User-Agent
header and that it's what you're referring to by "like for http requests." Thus, I don't know what other "agent" you could be referring to here.
I guess I could blame it on cold I was having -- I totally forgot that we have also logs at apache level. And indeed it shows that current accesses are from bots and spiders, e.g.
47.128.53.217 - - [11/Mar/2024:12:05:57 -0400] "GET /dandisets/000108/draft/sub-SChmi53/ses-20220927h17m32s35/micr/sub-SChmi53_ses-20220927h17m32s35_
sample-29_stain-NN_run-1_chunk-4_SPIM.ome.zarr/0/0/0/7/0/ HTTP/1.1" 200 6525 "-" "Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Ge
cko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)"
85.208.96.200 - - [11/Mar/2024:12:06:34 -0400] "GET /dandisets/000579/latest/sub-3/ HTTP/1.1" 200 6673 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl;
+http://www.semrush.com/bot.html)"
do we have similar proxying done for the https://webdav.dandiarchive.org/ ? do its logs available and contain similar information?
do we have similar proxying done for the https://webdav.dandiarchive.org/ ?
Well, something's got to provide HTTPS access.
do its logs available and contain similar information?
CC @mvandenburgh
We could set up Papertrail to log requests like we have on dandi-archive, which would include the client IP. @yarikoptic should I do that, and if so which plan should I use? See the "Plans & Pricing" section here - https://elements.heroku.com/addons/papertrail
I don't rely on papertrail really... may be I could just fetch them from heroku directly similarly how I do for dandi-api and staging. Added it to the /mnt/backup/dandi/heroku-logs . But they lack IPs there:
2024-03-12T20:08:06.731116+00:00 heroku[router]: at=info method=GET path="/" host=webdav.dandiarchive.org request_id=07815ef3-0f0e-4ea0-8213-507081b8fd52 fwd="24.54.13.166" dyno=web.1 connect=0ms service=1ms status=200 bytes=1809 protocol=https
2024-03-12T20:08:06.780546+00:00 heroku[router]: at=info method=GET path="/.static/styles.css" host=webdav.dandiarchive.org request_id=7ae70577-3ebb-487e-9b08-d12b9e669e2e fwd="24.54.13.166" dyno=web.1 connect=0ms service=1ms status=200 bytes=1108 protocol=https
2024-03-12T20:08:06.730369+00:00 app[web.1]: 2024-03-12T20:08:06.730306Z DEBUG request{method=GET uri=/ version=HTTP/1.1}: tower_http::trace::on_request: started processing request
2024-03-12T20:08:06.730450+00:00 app[web.1]: 2024-03-12T20:08:06.730434Z DEBUG request{method=GET uri=/ version=HTTP/1.1}: tower_http::trace::on_response: finished processing request latency=0 ms status=200
2024-03-12T20:08:06.779766+00:00 app[web.1]: 2024-03-12T20:08:06.779714Z DEBUG request{method=GET uri=/.static/styles.css version=HTTP/1.1}: tower_http::trace::on_request: started processing request
2024-03-12T20:08:06.779781+00:00 app[web.1]: 2024-03-12T20:08:06.779766Z DEBUG request{method=GET uri=/.static/styles.css version=HTTP/1.1}: tower_http::trace::on_response: finished processing request latency=0 ms status=200
@mvandenburgh do you think they would magically appear at papertrail level? why nothing relevant at heroku?
@mvandenburgh ping on above question
@mvandenburgh do you think they would magically appear at papertrail level? why nothing relevant at heroku?
I think what we actually need to look at are the heroku router logs, as that is the analog of Apache proxy for the heroku deployment. I'll confirm this though.
correct -- we seems to be getting them in router
logs, similarly for the dandi-archive. So let's assume that is enough for now. Thanks!
so we could get stats, catch abusers, etc