dandi / dandidav

WebDAV view to DANDI Archive
MIT License
4 stars 2 forks source link

log IP accessing dandidav #99

Closed yarikoptic closed 6 months ago

yarikoptic commented 6 months ago

so we could get stats, catch abusers, etc

yarikoptic commented 6 months ago

I wonder if there is information about agent like for http requests? I just see something slowly going through the instance of dandidav. But if it is some bot -- we might even ban/provide agents configuration to avoid crawling it.

jwodder commented 6 months ago

@yarikoptic Since dandidav is behind an Apache(?) proxy for HTTPS purposes, by default it'll only have the IP address for the proxy. In order for the actual client IP address to get to dandidav, the proxy server will need to send an X-Forwarded-For or Forwarded header in its requests to dandidav (and this will also need to be set up as part of the Terraform deployment).

However, the proxy server's access logs should already include the client IP address for requests to dandidav, so I'm unclear why you need dandidav to log this.

I wonder if there is information about agent like for http requests?

I assume that you're aware of the User-Agent header and that it's what you're referring to by "like for http requests." Thus, I don't know what other "agent" you could be referring to here.

yarikoptic commented 6 months ago

I guess I could blame it on cold I was having -- I totally forgot that we have also logs at apache level. And indeed it shows that current accesses are from bots and spiders, e.g.

47.128.53.217 - - [11/Mar/2024:12:05:57 -0400] "GET /dandisets/000108/draft/sub-SChmi53/ses-20220927h17m32s35/micr/sub-SChmi53_ses-20220927h17m32s35_
sample-29_stain-NN_run-1_chunk-4_SPIM.ome.zarr/0/0/0/7/0/ HTTP/1.1" 200 6525 "-" "Mozilla/5.0 (Linux; Android 5.0) AppleWebKit/537.36 (KHTML, like Ge
cko) Mobile Safari/537.36 (compatible; Bytespider; spider-feedback@bytedance.com)"
85.208.96.200 - - [11/Mar/2024:12:06:34 -0400] "GET /dandisets/000579/latest/sub-3/ HTTP/1.1" 200 6673 "-" "Mozilla/5.0 (compatible; SemrushBot/7~bl;
 +http://www.semrush.com/bot.html)"

do we have similar proxying done for the https://webdav.dandiarchive.org/ ? do its logs available and contain similar information?

jwodder commented 6 months ago

do we have similar proxying done for the https://webdav.dandiarchive.org/ ?

Well, something's got to provide HTTPS access.

do its logs available and contain similar information?

CC @mvandenburgh

mvandenburgh commented 6 months ago

We could set up Papertrail to log requests like we have on dandi-archive, which would include the client IP. @yarikoptic should I do that, and if so which plan should I use? See the "Plans & Pricing" section here - https://elements.heroku.com/addons/papertrail

yarikoptic commented 6 months ago

I don't rely on papertrail really... may be I could just fetch them from heroku directly similarly how I do for dandi-api and staging. Added it to the /mnt/backup/dandi/heroku-logs . But they lack IPs there:

2024-03-12T20:08:06.731116+00:00 heroku[router]: at=info method=GET path="/" host=webdav.dandiarchive.org request_id=07815ef3-0f0e-4ea0-8213-507081b8fd52 fwd="24.54.13.166" dyno=web.1 connect=0ms service=1ms status=200 bytes=1809 protocol=https
2024-03-12T20:08:06.780546+00:00 heroku[router]: at=info method=GET path="/.static/styles.css" host=webdav.dandiarchive.org request_id=7ae70577-3ebb-487e-9b08-d12b9e669e2e fwd="24.54.13.166" dyno=web.1 connect=0ms service=1ms status=200 bytes=1108 protocol=https
2024-03-12T20:08:06.730369+00:00 app[web.1]: 2024-03-12T20:08:06.730306Z DEBUG request{method=GET uri=/ version=HTTP/1.1}: tower_http::trace::on_request: started processing request
2024-03-12T20:08:06.730450+00:00 app[web.1]: 2024-03-12T20:08:06.730434Z DEBUG request{method=GET uri=/ version=HTTP/1.1}: tower_http::trace::on_response: finished processing request latency=0 ms status=200
2024-03-12T20:08:06.779766+00:00 app[web.1]: 2024-03-12T20:08:06.779714Z DEBUG request{method=GET uri=/.static/styles.css version=HTTP/1.1}: tower_http::trace::on_request: started processing request
2024-03-12T20:08:06.779781+00:00 app[web.1]: 2024-03-12T20:08:06.779766Z DEBUG request{method=GET uri=/.static/styles.css version=HTTP/1.1}: tower_http::trace::on_response: finished processing request latency=0 ms status=200

@mvandenburgh do you think they would magically appear at papertrail level? why nothing relevant at heroku?

yarikoptic commented 6 months ago

@mvandenburgh ping on above question

mvandenburgh commented 6 months ago

@mvandenburgh do you think they would magically appear at papertrail level? why nothing relevant at heroku?

I think what we actually need to look at are the heroku router logs, as that is the analog of Apache proxy for the heroku deployment. I'll confirm this though.

yarikoptic commented 6 months ago

correct -- we seems to be getting them in router logs, similarly for the dandi-archive. So let's assume that is enough for now. Thanks!