fedora-infra / mirrormanager2

Rewrite of the MirrorManager application in Flask and SQLAlchemy
https://mirrormanager.fedoraproject.org
GNU General Public License v2.0
63 stars 46 forks source link

Empower globally available mirrors to be accurately represented #267

Open survient opened 5 years ago

survient commented 5 years ago

Issue #194 already covers providing a way for a mirror to be marked as "global" but with a low priority. One issue I see with this is that a globally available mirror may have presence in areas with an abundant number of local mirrors but due to being deemed "low-priority" it may never get any hits and winds up being purely a fail-over mirror.

Another issue I see comes up when you have a globally available mirror load balanced via DNS using a feature of a platform such as an F5 GTM load balancer. To quote my message to the fedora mirror mailing distro:

Our DNS entries are load balanced via F5 GTM load balancers such that we use the same fqdn for our mirrors globally, but caching nameservers pull different entries depending on their latency/region. The problem is that we aren't able to add the same URL to multiple hosts as can be seen with my first issue. What I'd like to see is the ability for a single "Site" to have multiple hosts defined that are able to re-use the same URL in the case that global server load balancing via DNS is being used, possibly only with fedora admin approval and/or unique ASNs are used.

While I did suggest having the ability for multiple hosts to use the same URL, I'm not intimately familiar with the MirrorManager's inner workings so there may be a better way to go about this. I wanted to open this case at the recommendation of the fedora mirror mailing distro to brainstorm on this idea and see if a feasible solution could be crafted that has a reasonable chance of getting implemented within MirrorManager.

Alternatively possibly something along the lines of allowing the "Country" field to be overloaded, and/or the ASN field to be overloaded? Maybe some association between the two so that there is an ASN defined per country(unless MirrorManager would be intelligent enough to do this automatically)?

adrianreber commented 5 years ago

If someone provides a patch I am pretty sure we will include this and in theory it should not be too difficult to implement.

The problem I am seeing is that the URL is marked as something unique and there are probably multiple places in MirrorManager where the current code relies on that uniqueness.

This is probably also only useful for a small number of mirrors. If somebody is interested in exploring and implementing this that would be great.

mdomsch commented 5 years ago

The "low priority" "always_up2date" mirror configuration combination of options was intentional. We put the Fedora Infrastructure-maintained mirrors in that configuration, as mirrors of last resort. They're at the bottom of the prioritized mirrorlist then. They shouldn't be used directly (that's why we have all the mirrors) unless the mirrors are somehow all stale, or there are no mirrors in that country or continent.

We did not have globally available, single DNS named servers when we first wrote MirrorManager, so this is certainly not a design consideration. MM does do DNS lookups and then ASN lookups from a copy of the global route table, and GeoIP lookups to get the country. That DNS lookup only happens from one location (within Fedora Infrastructure), so MM would only see one DNS response for a mirror, and only look up one ASN and one country for it. In your proposal, MM would somehow have to learn of the other possible IP addresses (e.g. told via the admin interface), because it can't do it via DNS directly from the single location.

The uniqueness restriction was really only there to prevent one site admin from claiming to manage a Host already claimed by another Site admin. I suspect that could be removed.

mdomsch commented 5 years ago

The model expects you'd have multiple Hosts, one for each actual mirror (each one behind the load balancer). As crawls only happen from the one place (unless private or always_up2date is set, then no crawling), if you had 2 Hosts with the same URLs, MM couldn't crawl them separately, as they would resolve for the single-location crawler to the same IP address.

survient commented 5 years ago

If you manually specify the ASN per host entry would it make any difference if the same url was specified on each host(if multiple hosts with the same url was allowed)?

mdomsch commented 5 years ago

MirrorManager admins (e.g. @adrianreber) can manually add ASNs to a given Host entry. Then the single Host entry (and single URL) can preferrentially serve multiple ASNs. That was intended for obvious peering setups for private mirrors, which wouldn't completely work for you as a public mirror, because you want to attract traffic from ASNs other than just your peer list. We do not maintain a map of all global ASNs and how they route to each other.

Could you create different DNS entries that all resolve to the same load balancer, and then create separate Host entries, one with each URL, to get around the unique URL restriction? That would let you specify a separate country, and less important, a separate ASN, for each. That would let you attract end users from the same country and same continent to each mirror behind the load balancer. The public mirror list would then show the multiple Hosts, each with it's own URL and country, even though in practice they all resolve to the same load balancer internally.

survient commented 5 years ago

Thanks @mdomsch. We're in the process of doing what you're describing as a workaround, my network team has not created the needed DNS entries on our load balancers yet.

Reading over your explanation a new question came to me. You're saying that Mirror Manager resolves the hostnames from the same location and determines the host country via GeoIP. In this case why do hosts have a country field? Is this merely just for notes or does it actually have any technical impact.

I was also beginning to think of a different proposal as well. Would it be feasible to add an extra optional field to each host for the "hostname" but still have a separate URL field? My thinking is kind of along the lines of what you're talking about with the multiple DNS entries but instead of separate URLs just to find an optional way to make Mirror Manager aware that the URL is hosted in multiple geographic regions. Mirror Manager would use the hostname field to resolve the IP, but when clients receive the mirror details they get whatever URL is specified. Thinking this over I can see some integrity and security concerns so maybe this is only permitted by Mirror Manager admins though I'm pretty sure with some code there may be a way to automate some DNS validation that would cross check the domain of the URL against the hostname specified. I completely understand that any solution that results in extra code would need to be handled which I'm happy to look into doing. I just want to get some direction from the current Mirror Manager maintainers to see if such an approach even sounds reasonable. Thanks for the assistance.

mdomsch commented 5 years ago

MM doesn't generally look up mirror Hosts in GeoIP, only clients as they connect to the MM redirector. There was a one-time import of original mirrors (pre-dating MM) where we did GeoIP lookups, but that was years ago. There is also an command line option in the crawler to crawl only a single continent that does hostname lookups to IP to geoip continent, but I doubt that's actually being used in practice.

In practice, the geoip lookups are just on clients as they connect. The static country value entered in the MM admin UI is used as you'd expect.

mdomsch commented 5 years ago

Would it suffice if a Host could be in multiple countries instead of just one? Or do you have multiple mirrors in the same country?

survient commented 5 years ago

If a host could be in multiple countries that would be great and definitely something I'd like to seen pursued as an option.

We do have multiple mirrors in the same country(3 in the US) however it doesn't really matter based on what you've described. If the client is doing GeoIP lookups then their DNS will resolve whichever IP is closest to them by nature of the GTM load balancing. For example if I do lookups on our caching nameservers in each of our 3 US datacenters:

$ dig +short mirror.rackspace.com @cachens1.dfw3.rackspace.com 74.205.112.120 $ dig +short mirror.rackspace.com @cachens1.ord1.rackspace.com 166.78.229.131 $ dig +short mirror.rackspace.com @cachens1.iad3.rackspace.com 72.4.120.219

there are 3 different IPs returned even though the hostname remains the same, and the same IP is returned each time for each region(no round robin). If other caching nameservers are used within those regions the same IP should be returned with respect to the closest region to the nameserver being used.

mdomsch commented 5 years ago

ok, a few things. (Hi @adrianreber. I hope you don't mind my jumping in here. @survient it's been several years since I had anything to do with MM development, though I was the original author. Adrian and team have done a fantastic job with it for 5+ years).

The model has HostCountries and HostCountriesAllowed. I suspect they're not actually used. On the Host HTML page, the description talks about adding to HostCountriesAllowed, though the code there actually adds to HostCountries. HostCountriesAllowed is used. @adrianreber are those two tables empty in production? That'd be my guess.

Host.country is a simple string. One could either allow that to be a comma-separated string of country codes, or could actually use the HostCountries table. The former is somewhat simpler, I can even provide half a patch for it that splits that string by comma, and adds the host to each byCountry[country] list in the cache.

Other than the above, Host.country as a single entry is really only used in one place explicitly, generating the host entry in the metalink result with location=XX. When the metalink is generated, we have a list of Hosts and their URLs, but don't know which of the several countries for that host we should be reporting it being in. It could be in a neighboring country on the same continent, so we can't use any of the request info to inform us. In practice I don't know how much it matters. yum/dnf just walk the list in priority order, the location=XX is just there for information, and if it were "wrong", I don't think they would break. Therefore, we could just return the first entry in a list if there are multiple.

mdomsch commented 5 years ago

(untested PR submitted, if only to demonstrate how I think it could work.)

innoobijr commented 1 year ago

dont know if this is still active, but would like to take a wack at this. Also @adrianreber @mdomsch would love to chat will y'all about the Fedora mirror infra in general.

adrianreber commented 1 year ago

@innoobijr you can reach out via email if you have questions.