dmwm / das2go

Go implementation of Data Aggregation System (DAS) for CMS experiment
MIT License
2 stars 3 forks source link

Instabilities of DAS site dataset=/a/b/c queries #32

Closed vkuznet closed 2 years ago

vkuznet commented 3 years ago

I got report from Felipe Gómez-Cortés who claimed that DAS web UI provides different results for the following query:

site dataset=/ReggeGribovPartonMC_EposLHC_pPb_4080_4080/pPb816Spring16GS-80X_mcRun2_asymptotic_v17-v1/GEN-SIM

After series of iterations I confirmed that this is the case using my dev environment. I identified that the problem is related to unability of DAS to contact with cms-rucio service yield the following errors:

2021/06/05 11:17:28 fetch.go:486: ERROR: fail to fetch http://cms-rucio.cern.ch/replicas/cms//ReggeGribovPartonMC_EposLHC_pPb_4080_4080/pPb816Spring16GS-80X_mcRun2_asymptotic_v17-v1/GEN-SIM#3ce2d95e-7168-11e6-9fb1-002590494fb0/datasets, retries 3, error Get "http://cms-rucio.cern.ch/replicas/cms//ReggeGribovPartonMC_EposLHC_pPb_4080_4080/pPb816Spring16GS-80X_mcRun2_asymptotic_v17-v1/GEN-SIM#3ce2d95e-7168-11e6-9fb1-002590494fb0/datasets": dial tcp: lookup cms-rucio.cern.ch: no such host
2021/06/05 11:17:28 fetch.go:486: ERROR: fail to fetch http://cms-rucio.cern.ch/replicas/cms//ReggeGribovPartonMC_EposLHC_pPb_4080_4080/pPb816Spring16GS-80X_mcRun2_asymptotic_v17-v1/GEN-SIM#c541fd04-7198-11e6-9fb1-002590494fb0/datasets, retries 3, error Get "http://cms-rucio.cern.ch/replicas/cms//ReggeGribovPartonMC_EposLHC_pPb_4080_4080/pPb816Spring16GS-80X_mcRun2_asymptotic_v17-v1/GEN-SIM#c541fd04-7198-11e6-9fb1-002590494fb0/datasets": dial tcp: lookup cms-rucio.cern.ch: no such host

We need to identify the source of this issue. @ericvaandering any ideas?

ericvaandering commented 3 years ago

I assume this is intermittent? The first thing I’ll check is that the dns entries all have an ingress on them

ericvaandering commented 3 years ago

Didn't your hosts have some DNS issue before? Because I confirm that even from my desktop at home, cms-rucio.cern.ch resolves correctly to two IPs. And both those nodes in k8s have role=ingress.

The error message sure looks like basic DNS lookup is failing, not that it's failing to connect or find a service on the IP address.

vkuznet commented 3 years ago

yes, the issue is intermittent, sometimes everything is fine while another attempt it is not. I can see see in production k8s and at my home running DAS from local laptop. I have suspicion that it is related to the concurrent load on DNS server(s).

vkuznet commented 3 years ago

I think it is old discussion of some racing conditions in Go network stack. I changed production server to use a queue and constraint it to max of 100 concurrent requests calls. After few test iterations I no longer see the problem. I'll leave ticket open and will check site queries to verify if it fix the problem.

vkuznet commented 3 years ago

The issue seems to be related to this open GoLang ticket

vkuznet commented 3 years ago

and I found yet another discussion on this topic, see the following ticket

vkuznet commented 2 years ago

we no longer see this issue, closing