Get documentation on two different languages

Conkernel commented 5 months ago

Hi,

Some time ago I managed to get the documentation in spanish using your solution. It works likr a charm. The probnlem is thar quite a lot of pages od the documentation are now written to Spanish. So, for me, it could be great to find a way to "print" all the existent doc in Spanish, and all those pages that are not translated, print them in english. Is it to much to ask for an update on the code in order to make it possible?

Thanks a lot for your effort

dohsimpson commented 5 months ago

Hi @Conkernel , I think in your last opened issue you were having some troubles with generating the spanish doc. Really glad you got spanish doc working!

If I understand correctly, you want to generate the same doc, but only use spanish if it's translated, and revert to English if it's not translated?

I don't think I will add support for this use case, as I don't have usage for it personally. However, you could try modify https://github.com/dohsimpson/kubernetes-doc-pdf/blob/master/kubernetes-doc.py which by default crawls the english site, to add some handler that checks if the spanish version of the page exists, if so, replace the URL with the spanish URL, if not, default to English URL.

Conkernel commented 5 months ago

Hi @dohsimpson

Thanks for your answer. I will try to do what you say. I can't code, but maybe chatgpt can help with that

Thanks a lor for your help!

Regards

dohsimpson commented 5 months ago

no problem, Good luck to you! Gonna close the ticket now.

Conkernel commented 5 months ago

Hi @dohsimpson

I din't want to bother with all this thing about getting support for different languages, but I just wan to make you know that I was able to finally get a two languages verion of the app. 1st to know is that It's the first time I try to code a "complex" app. Actually, I only tried making some kind of bash scripts one time, so you will undertand why this code is so bad :)

This version will spend much more time to get the doc from the website, as it will check if all the urls from the english version also exists in the 2nd language. Also, It will also catch some urls that only exists in the 2nd lang and not in english. This way I think you should get all the existing web pages, for english and for the language you choose.

The only thing to modify in order to get other language than "es", is to change tje "lang" variable in the first lines of the code, and set it to another one (linke it, for example). Thgis way, you should be able to dowload all the web pages that exists in "it" language, and the rest will be in english.

The code will create a new foldr called "tmp/links_{concept}", where concept is some of "setup|reference|tasks|tutorials|concepts". Inside all these folders, you will be able to find the urls of all the web pages it downloaded...

Please, be free to change anything inside the code, as I'm more than sure it can be improved in a lot of ways.

Also, thanks for making this project possible, as now I've learned to code a little better :)

Hre is the code:

import requests_html as rh
import os
import subprocess
import requests
import json
from pathlib import Path

# to change language, set the content of "lang" to iso code. Also, please check that it already exist on k8s web
lang = "es"

def generate_directory_pdf(url1, name, s=None):
    # some needed variables...
    mydir = Path(f"tmp/links_{name}")
    mydir.mkdir(parents=True, exist_ok=True)
    final_links_to_download = f"tmp/links_{name}/links_to_download.json"
    url2 = f"https://kubernetes.io/{lang}/docs/{name}"

    s = rh.HTMLSession() if not s else s
    r1 = s.get(url1)
    r2 = s.get(url2)
    html = ""
    anchors1 = r1.html.find('.td-sidebar-link')
    anchors2 = r2.html.find('.td-sidebar-link')
    links_en = [a.absolute_links.pop() for a in anchors1 if a.element.tag == 'a']
    links_es = [a.absolute_links.pop() for a in anchors2 if a.element.tag == 'a']

    links_en_uniq_a_comprobar = []
    for i in links_en:
        if i not in links_en_uniq_a_comprobar:
            links_en_uniq_a_comprobar.append(i)

    links_solo_es_uniq = []
    for i in links_es:
        if i not in links_solo_es_uniq:
            links_solo_es_uniq.append(i)

    links_es_uniq_a_comprobar = []
    links_es_uniq_a_comprobar = [link.replace("kubernetes.io/docs", "kubernetes.io/{lang}/docs") for link in links_en_uniq_a_comprobar]

    def check_url(tocheck):
        try:
            response = requests.get(tocheck, timeout=5)
            if response.status_code == 200:
                return True
            else:
                return False
        except requests.RequestException:
            return False

    checked_links_mixed = []
    for english, spanish in zip(links_en_uniq_a_comprobar, links_es_uniq_a_comprobar):
        if check_url(spanish):
            checked_links_mixed.append(spanish)
        else:
            checked_links_mixed.append(english)

    mixed_links_to_uniq = checked_links_mixed + links_solo_es_uniq
    filtered_mixed_links_for_lambda = []
    for i in mixed_links_to_uniq:
        if i not in filtered_mixed_links_for_lambda:
            filtered_mixed_links_for_lambda.append(i)

    links_post_lambda = filter(lambda href: href.startswith(url1) or href.startswith(url2), filtered_mixed_links_for_lambda)
    links_post_lambda_list = list(links_post_lambda)

    with open(final_links_to_download, 'w') as output_file:
                json.dump(links_post_lambda_list, output_file, indent=4)

    print("Downloading content from links...")
    cwd = os.getcwd()
    for l1 in links_post_lambda_list:
        r2 = s.get(l1)
        div = r2.html.find('.td-content', first=True, clean=True)
        if div:
            html += div.html
        with open("{}/{}.html".format(cwd, name), "wt") as f:
            f.write(html)

    print("generating pdf in " + name )
    subprocess.run(["{}/weasy_print.sh".format(cwd), name])

if __name__ == '__main__':
    s = rh.HTMLSession()
    directories = [\
                   "setup",
                   "concepts",
                   "tasks",
                   "tutorials",
                   "reference",
                   ]
    directories_pairs = [("https://kubernetes.io/docs/{}/".format(n.lower()), n) for n in directories]
    for url1, name in directories_pairs:
        print("Working with the content in url : " + url1)
        generate_directory_pdf(url1, name)

I tried to clean the code as much as I could, because my original one was full of tests and different ways to try to undertend how thinks work. IF you get any kind of trouble with the clean code, here you have the dirty one, but the one that I made most tests with:

import requests_html as rh
import os
# import pypandoc
import subprocess
import time
import requests
import json
from pathlib import Path

def generate_directory_pdf(url1, name, s=None):
    mydir = Path(f"tmp/links_{name}")
    mydir.mkdir(parents=True, exist_ok=True)
    divss = f"tmp/links_{name}/divs.json"
    file_links_en_a_comprobar = f"tmp/links_{name}/listado_en_a_comprobar.json"
    file_links_es_a_comprobar = f"tmp/links_{name}/listado_es_a_comprobar.json"
    file_solo_es_comprobados = f"tmp/links_{name}/file_solo_es_comprobados.json"
    file_solo_en_comprobados = f"tmp/links_{name}/file_solo_en_comprobados.json"
    file_filtered_mixed_links_for_lambda = f"tmp/links_{name}/filtered_mixed_links_for_lambda.json"
    file_links_post_lambda_to_download = f"tmp/links_{name}/Final_post_lambda_links_to_download.json"
    file_links_post_lambda_LIST_to_download = f"tmp/links_{name}/Final_post_lambda_LIST_links_to_download.json"
    lang = "es"

    url2 = f"https://kubernetes.io/{lang}/docs/{name}"
    # Almacenamos en links todos las refrencias que encontramos en url:
    s = rh.HTMLSession() if not s else s
    r1 = s.get(url1)
    r2 = s.get(url2)
    html = ""
    anchors1 = r1.html.find('.td-sidebar-link')
    anchors2 = r2.html.find('.td-sidebar-link')
    links_en = [a.absolute_links.pop() for a in anchors1 if a.element.tag == 'a']
    links_es = [a.absolute_links.pop() for a in anchors2 if a.element.tag == 'a'] # todas las que ha encontrado en español

    # Uniq en el total de urls
    links_en_uniq_a_comprobar = []
    for i in links_en:
        if i not in links_en_uniq_a_comprobar:
            links_en_uniq_a_comprobar.append(i) # Limpiamos repeticiones de la misma url

    links_solo_es_uniq = []
    for i in links_es:
        if i not in links_solo_es_uniq:
            links_solo_es_uniq.append(i) # Limpiamos repeticiones de la misma url

    # Generamos una lista a_comprobar_en_to_es con todas las urls convertidas al esp:
    links_en_converted_to_es = []
    links_en_converted_to_es = [link.replace("kubernetes.io/docs", "kubernetes.io/es/docs") for link in links_en_uniq_a_comprobar]

    #Cambiamos nombre para facilitar nmbrado entre idiomas:
    links_es_uniq_a_comprobar = links_en_converted_to_es

    # Volcamos links_en_uniq_a_comprobar en fichero listado_en
    with open(file_links_en_a_comprobar, 'w') as output_file:
                json.dump(links_en_uniq_a_comprobar, output_file, indent=4)

    # Volcamos a_comprobar_en_to_es en fichero listado_es
    with open(file_links_es_a_comprobar, 'w') as output_file:
                json.dump(links_es_uniq_a_comprobar, output_file, indent=4)

    # Función que comprueba conexión del contenido de url
    def check_url(tocheck):
        try:
            response = requests.get(tocheck, timeout=5)
            if response.status_code == 200:
                return True
            else:
                return False
        except requests.RequestException:
            return False

    # Comprobamos conexión del contenido de links y de links_es
    # Se añaden a checked_links todos los que existen en español, y el resto en inglés
    checked_links_mixed = []
    links_es_comprobados = []
    links_en_comprobados = []
    for english, spanish in zip(links_en_uniq_a_comprobar, links_es_uniq_a_comprobar):
        #print(f"Se está probando las urls: {english}  y  {spanish}")
        if check_url(spanish):
            checked_links_mixed.append(spanish)
            links_es_comprobados.append(spanish)

            #print(f"Se añade la url {spanish} a Epanish y se descarta {english}")
        else:
            checked_links_mixed.append(english)
            links_en_comprobados.append(english)

    print(f"El listado de links en ESPAÑOL COMPROBADOS se puede mirar ya en el fichero file_solo_es_comprobados")
    with open(file_solo_es_comprobados, 'w') as output_file:
                json.dump(links_es_comprobados, output_file, indent=4)

    print(f"El listado de links en INGLÉS COMPROBADOS se puede mirar ya en el fichero file_solo_en_comprobados")
    with open(file_solo_en_comprobados, 'w') as output_file:
                json.dump(links_en_comprobados, output_file, indent=4)
    time.sleep(15)

    # Añadimos los links que SOLO se han encontrado en español
    mixed_links_to_uniq = checked_links_mixed + links_solo_es_uniq

    # Limpiamos posibles repeticiones entre mixed_links_to_download y links_solo_es_uniq
    filtered_mixed_links_for_lambda = []
    for i in mixed_links_to_uniq:
        if i not in filtered_mixed_links_for_lambda:
            filtered_mixed_links_for_lambda.append(i) # Limpiamos repeticiones de la misma url

# Tras esto, solo deberían quedar las urls que tienen conexión, mezcladas en dos idiomas

    #for i in links_to_download:
    #    print(f"Links comparados y añadidos a -solo español-, uno por uno (antes de lambda): {i}")

    print("Longitud de filtered_mixed_links_for_lambda:", len(filtered_mixed_links_for_lambda))
    print("final_total_links:", filtered_mixed_links_for_lambda)
    time.sleep(10)

    # Escribimos en el fichero el contenido final de las urls a descargar ANTES de LAMBDA:
    with open(file_filtered_mixed_links_for_lambda, 'w') as output_file:  # Cambié 'file' a 'output_file'
                json.dump(filtered_mixed_links_for_lambda, output_file, indent=4)
    print(f"Ya se pueden mirar los links previos a lambda en file_filtered_mixed_links_for_lambda")
    time.sleep(15)

    #yinks = filter(lambda href: href.startswith(url2), links_variable)
    links_post_lambda = filter(lambda href: href.startswith(url1) or href.startswith(url2), filtered_mixed_links_for_lambda)
    links_post_lambda_list = list(links_post_lambda)

    # Probar qué hay en list_yinks, una vez pasado el LAMBDA:
    with open(file_links_post_lambda_LIST_to_download, 'w') as output_file:  # Cambié 'file' a 'output_file'
                json.dump(links_post_lambda_list, output_file, indent=4)

    print("Ya se puede mirar el listado tras lambda en file_links_post_lambda_LIST_to_download")
    print("Longitud de links_post_lambda_list DESPUES de LAMBDA:", len(links_post_lambda_list))

    input("Presiona Enter para continuar...")

    print("downloading...")
    cwd = os.getcwd()
    for l1 in links_post_lambda_list:
        print(f"Final Links post Lambda: {l1}")
        r2 = s.get(l1)
        div = r2.html.find('.td-content', first=True, clean=True)
        print(f"Se están buscando los divs en {l1}. Esto es un div: {div}.")
        if div:
            print(f"Existe div: {div}")
            html += div.html
        with open("{}/{}.html".format(cwd, name), "wt") as f:
            f.write(html)

    print("generating pdf in " + name )
    subprocess.run(["{}/weasy_print.sh".format(cwd), name])

if __name__ == '__main__':
    s = rh.HTMLSession()
    directories = [\
                   "setup",
                   "concepts",
                   "tasks",
                   "tutorials",
                   "reference",
                   ]
    directories_pairs = [("https://kubernetes.io/docs/{}/".format(n.lower()), n) for n in directories]
    for url1, name in directories_pairs:
        print("URL: " + url1, "Directorio: " + name)
        print(name)
        generate_directory_pdf(url1, name)
        print("Generamos directorio con " + url1 + " y " + name )

I hope it works for you.

Best regards!!

dohsimpson commented 5 months ago

@Conkernel Congrats! That's amazing, great work! I'm so glad that you were able to learn to code and solve your problem! Keep up with the good work!

Conkernel commented 5 months ago

Thanks. As I didn't know if you were goin to see this closed thread, I opened a new issue to chare the code with you. As it's not an issue, you can just close it, or do whatever you want with it, as my version need quite a lot of improvement in order to be pushished :)

Regards @dohsimpson

dohsimpson / kubernetes-doc-pdf

Get documentation on two different languages #15