ESGF / esgf-download

ESGF data transfer and replication tool
https://esgf.github.io/esgf-download/
BSD 3-Clause "New" or "Revised" License
15 stars 2 forks source link

identifying "official" CMIPx replication node downloads #32

Open durack1 opened 8 months ago

durack1 commented 8 months ago

Hi all, we've been reviewing the CMIP6 download statistics at http://esgf-ui.cmcc.it/esgf-dashboard-ui/cmip6.html, and am curious about how we could use a custom user-agent string to identify "registered" replication nodes (e.g. LLNL/aims3.llnl.gov, DKRZ/esgf1.dkrz.de, CEDA/esgf.ceda.ac.uk, ...) vs "non-registered" general downloads. In the current CMCC stats, these 3 replication nodes account for ~30, 18 and 10% of the CMIP6 downloads through early September 2023.

ping @sandrofiore @alenu @ant4res @sashakames @climate-dude

AtefBN commented 8 months ago

Hi @durack1 just an FYI we talked about this @svenrdz and I, and we have no objection to include a distinct user-agent to esgpull download queries. However this might not be 100% accurate representation of replication downloads, since the tool can be used by anyone to download ESGF data. One possibility would be to make this configuration enabled. Although the replication volume will always overwhelm the rest of the use cases of esgpull in my completely statistics devoid opinion.

durack1 commented 8 months ago

@AtefBN great, thanks for engaging on this. We had thought that have a user-agent that could be configured by a user - the ones we're most interested in are the primary replication users, so for CMIP6/currently we have LLNL, DKRZ, CEDA and NCI - will have to watch how the network evolves going forward. If we could figure out a way to register these instances of esgpull (during the software config), then it should be possible for the download statistics to identify "registered" replication services, vs other use cases

AtefBN commented 8 months ago

I agree and I don't see why we cannot include this in the software. I am not sure how widely esgpull replaced synda in the replication nodes though, per the CDNOT discussion yesterday, couple nodes are still running on synda which could/should change in the upcoming months.

durack1 commented 8 months ago

@AtefBN and exactly. At LLNL we are still using (an old version of) Synda, and hadn't planned to update to esgpull until we move to a new project - likely CMIP6Plus preceding CMIP7...

alenu commented 7 months ago

Hi all, on the statistics side, we made a test on the "user-agent" field of the HTTP entry we get for each download. We store the user-agent information, so we can easily filter on a custom value to distinguish between generic downloads and replicas. Adding @anfab in the thread who is also involved in this activity.