etalab / noms-de-domaine-organismes-secteur-public

Liste de noms de domaine d'organismes publics
Other
22 stars 15 forks source link

Those domains don't resolve an address without their www prefix. #13

Closed JulienPalard closed 2 years ago

JulienPalard commented 2 years ago

As we want only domains giving a 200 over HTTP, it's better if they resolve.

Browsers are cool nowadays: if a domain don't resolve an address they try prefixing a www automatically, so from a user point of view they do return a 200 though. But I think if we want to script some tools from this dataset, this is better that way.

bzg commented 2 years ago

Thanks! I'm not sure I understand: what was the problem and how does this PR fixes it?

I see the nice additions, but I'd like to better understand your first message :)

JulienPalard commented 2 years ago

I first ran my certificate-watcher, and from the result file (named errors) I did:

grep 'Name or service not known' errors  | # Find those not resolving according to certificate-watcher
cut -d: -f1 | # Get just the domain name
while read -r line
do
if [ -z "$(dig A "$line" +short)" ] # Ensure it does **not** resolve an IPv4
then
    if [ -n "$(dig A "www.$line" +short)" ] # Ensure it **does** resolve an IPv4 when www. is added
    then
        sed -i "s/^$line$/www.$line/g" *.txt sources/*.txt  # Fix it
    fi
fi
done

After running that, I ran scripts/sort.py *.txt sources/*.txt.

So there's no addition:

 7 files changed, 340 insertions(+), 340 deletions(-)

and:

diff domaines-organismes-publics.txt <(cat sources/*.txt | sort) 

is still true.

bzg commented 2 years ago

Merci pour tous ces éléments. Je ne serai pas disponible avant deux jours pour les regarder, mais je le ferai.

bzg commented 2 years ago

Sorry, switching back to english. I turned on my brain and finally grokked what's going on here, thanks a lot for this, I'm rebasing/merging now.