Closed Conkernel closed 5 months ago
Hi @Conkernel , I think in your last opened issue you were having some troubles with generating the spanish doc. Really glad you got spanish doc working!
If I understand correctly, you want to generate the same doc, but only use spanish if it's translated, and revert to English if it's not translated?
I don't think I will add support for this use case, as I don't have usage for it personally. However, you could try modify https://github.com/dohsimpson/kubernetes-doc-pdf/blob/master/kubernetes-doc.py which by default crawls the english site, to add some handler that checks if the spanish version of the page exists, if so, replace the URL with the spanish URL, if not, default to English URL.
Hi @dohsimpson
Thanks for your answer. I will try to do what you say. I can't code, but maybe chatgpt can help with that
Thanks a lor for your help!
Regards
no problem, Good luck to you! Gonna close the ticket now.
Hi @dohsimpson
I din't want to bother with all this thing about getting support for different languages, but I just wan to make you know that I was able to finally get a two languages verion of the app. 1st to know is that It's the first time I try to code a "complex" app. Actually, I only tried making some kind of bash scripts one time, so you will undertand why this code is so bad :)
This version will spend much more time to get the doc from the website, as it will check if all the urls from the english version also exists in the 2nd language. Also, It will also catch some urls that only exists in the 2nd lang and not in english. This way I think you should get all the existing web pages, for english and for the language you choose.
The only thing to modify in order to get other language than "es", is to change tje "lang" variable in the first lines of the code, and set it to another one (linke it, for example). Thgis way, you should be able to dowload all the web pages that exists in "it" language, and the rest will be in english.
The code will create a new foldr called "tmp/links_{concept}", where concept is some of "setup|reference|tasks|tutorials|concepts". Inside all these folders, you will be able to find the urls of all the web pages it downloaded...
Please, be free to change anything inside the code, as I'm more than sure it can be improved in a lot of ways.
Also, thanks for making this project possible, as now I've learned to code a little better :)
Hre is the code:
import requests_html as rh
import os
import subprocess
import requests
import json
from pathlib import Path
# to change language, set the content of "lang" to iso code. Also, please check that it already exist on k8s web
lang = "es"
def generate_directory_pdf(url1, name, s=None):
# some needed variables...
mydir = Path(f"tmp/links_{name}")
mydir.mkdir(parents=True, exist_ok=True)
final_links_to_download = f"tmp/links_{name}/links_to_download.json"
url2 = f"https://kubernetes.io/{lang}/docs/{name}"
s = rh.HTMLSession() if not s else s
r1 = s.get(url1)
r2 = s.get(url2)
html = ""
anchors1 = r1.html.find('.td-sidebar-link')
anchors2 = r2.html.find('.td-sidebar-link')
links_en = [a.absolute_links.pop() for a in anchors1 if a.element.tag == 'a']
links_es = [a.absolute_links.pop() for a in anchors2 if a.element.tag == 'a']
links_en_uniq_a_comprobar = []
for i in links_en:
if i not in links_en_uniq_a_comprobar:
links_en_uniq_a_comprobar.append(i)
links_solo_es_uniq = []
for i in links_es:
if i not in links_solo_es_uniq:
links_solo_es_uniq.append(i)
links_es_uniq_a_comprobar = []
links_es_uniq_a_comprobar = [link.replace("kubernetes.io/docs", "kubernetes.io/{lang}/docs") for link in links_en_uniq_a_comprobar]
def check_url(tocheck):
try:
response = requests.get(tocheck, timeout=5)
if response.status_code == 200:
return True
else:
return False
except requests.RequestException:
return False
checked_links_mixed = []
for english, spanish in zip(links_en_uniq_a_comprobar, links_es_uniq_a_comprobar):
if check_url(spanish):
checked_links_mixed.append(spanish)
else:
checked_links_mixed.append(english)
mixed_links_to_uniq = checked_links_mixed + links_solo_es_uniq
filtered_mixed_links_for_lambda = []
for i in mixed_links_to_uniq:
if i not in filtered_mixed_links_for_lambda:
filtered_mixed_links_for_lambda.append(i)
links_post_lambda = filter(lambda href: href.startswith(url1) or href.startswith(url2), filtered_mixed_links_for_lambda)
links_post_lambda_list = list(links_post_lambda)
with open(final_links_to_download, 'w') as output_file:
json.dump(links_post_lambda_list, output_file, indent=4)
print("Downloading content from links...")
cwd = os.getcwd()
for l1 in links_post_lambda_list:
r2 = s.get(l1)
div = r2.html.find('.td-content', first=True, clean=True)
if div:
html += div.html
with open("{}/{}.html".format(cwd, name), "wt") as f:
f.write(html)
print("generating pdf in " + name )
subprocess.run(["{}/weasy_print.sh".format(cwd), name])
if __name__ == '__main__':
s = rh.HTMLSession()
directories = [\
"setup",
"concepts",
"tasks",
"tutorials",
"reference",
]
directories_pairs = [("https://kubernetes.io/docs/{}/".format(n.lower()), n) for n in directories]
for url1, name in directories_pairs:
print("Working with the content in url : " + url1)
generate_directory_pdf(url1, name)
I tried to clean the code as much as I could, because my original one was full of tests and different ways to try to undertend how thinks work. IF you get any kind of trouble with the clean code, here you have the dirty one, but the one that I made most tests with:
import requests_html as rh
import os
# import pypandoc
import subprocess
import time
import requests
import json
from pathlib import Path
def generate_directory_pdf(url1, name, s=None):
mydir = Path(f"tmp/links_{name}")
mydir.mkdir(parents=True, exist_ok=True)
divss = f"tmp/links_{name}/divs.json"
file_links_en_a_comprobar = f"tmp/links_{name}/listado_en_a_comprobar.json"
file_links_es_a_comprobar = f"tmp/links_{name}/listado_es_a_comprobar.json"
file_solo_es_comprobados = f"tmp/links_{name}/file_solo_es_comprobados.json"
file_solo_en_comprobados = f"tmp/links_{name}/file_solo_en_comprobados.json"
file_filtered_mixed_links_for_lambda = f"tmp/links_{name}/filtered_mixed_links_for_lambda.json"
file_links_post_lambda_to_download = f"tmp/links_{name}/Final_post_lambda_links_to_download.json"
file_links_post_lambda_LIST_to_download = f"tmp/links_{name}/Final_post_lambda_LIST_links_to_download.json"
lang = "es"
url2 = f"https://kubernetes.io/{lang}/docs/{name}"
# Almacenamos en links todos las refrencias que encontramos en url:
s = rh.HTMLSession() if not s else s
r1 = s.get(url1)
r2 = s.get(url2)
html = ""
anchors1 = r1.html.find('.td-sidebar-link')
anchors2 = r2.html.find('.td-sidebar-link')
links_en = [a.absolute_links.pop() for a in anchors1 if a.element.tag == 'a']
links_es = [a.absolute_links.pop() for a in anchors2 if a.element.tag == 'a'] # todas las que ha encontrado en español
# Uniq en el total de urls
links_en_uniq_a_comprobar = []
for i in links_en:
if i not in links_en_uniq_a_comprobar:
links_en_uniq_a_comprobar.append(i) # Limpiamos repeticiones de la misma url
links_solo_es_uniq = []
for i in links_es:
if i not in links_solo_es_uniq:
links_solo_es_uniq.append(i) # Limpiamos repeticiones de la misma url
# Generamos una lista a_comprobar_en_to_es con todas las urls convertidas al esp:
links_en_converted_to_es = []
links_en_converted_to_es = [link.replace("kubernetes.io/docs", "kubernetes.io/es/docs") for link in links_en_uniq_a_comprobar]
#Cambiamos nombre para facilitar nmbrado entre idiomas:
links_es_uniq_a_comprobar = links_en_converted_to_es
# Volcamos links_en_uniq_a_comprobar en fichero listado_en
with open(file_links_en_a_comprobar, 'w') as output_file:
json.dump(links_en_uniq_a_comprobar, output_file, indent=4)
# Volcamos a_comprobar_en_to_es en fichero listado_es
with open(file_links_es_a_comprobar, 'w') as output_file:
json.dump(links_es_uniq_a_comprobar, output_file, indent=4)
# Función que comprueba conexión del contenido de url
def check_url(tocheck):
try:
response = requests.get(tocheck, timeout=5)
if response.status_code == 200:
return True
else:
return False
except requests.RequestException:
return False
# Comprobamos conexión del contenido de links y de links_es
# Se añaden a checked_links todos los que existen en español, y el resto en inglés
checked_links_mixed = []
links_es_comprobados = []
links_en_comprobados = []
for english, spanish in zip(links_en_uniq_a_comprobar, links_es_uniq_a_comprobar):
#print(f"Se está probando las urls: {english} y {spanish}")
if check_url(spanish):
checked_links_mixed.append(spanish)
links_es_comprobados.append(spanish)
#print(f"Se añade la url {spanish} a Epanish y se descarta {english}")
else:
checked_links_mixed.append(english)
links_en_comprobados.append(english)
print(f"El listado de links en ESPAÑOL COMPROBADOS se puede mirar ya en el fichero file_solo_es_comprobados")
with open(file_solo_es_comprobados, 'w') as output_file:
json.dump(links_es_comprobados, output_file, indent=4)
print(f"El listado de links en INGLÉS COMPROBADOS se puede mirar ya en el fichero file_solo_en_comprobados")
with open(file_solo_en_comprobados, 'w') as output_file:
json.dump(links_en_comprobados, output_file, indent=4)
time.sleep(15)
# Añadimos los links que SOLO se han encontrado en español
mixed_links_to_uniq = checked_links_mixed + links_solo_es_uniq
# Limpiamos posibles repeticiones entre mixed_links_to_download y links_solo_es_uniq
filtered_mixed_links_for_lambda = []
for i in mixed_links_to_uniq:
if i not in filtered_mixed_links_for_lambda:
filtered_mixed_links_for_lambda.append(i) # Limpiamos repeticiones de la misma url
# Tras esto, solo deberían quedar las urls que tienen conexión, mezcladas en dos idiomas
#for i in links_to_download:
# print(f"Links comparados y añadidos a -solo español-, uno por uno (antes de lambda): {i}")
print("Longitud de filtered_mixed_links_for_lambda:", len(filtered_mixed_links_for_lambda))
print("final_total_links:", filtered_mixed_links_for_lambda)
time.sleep(10)
# Escribimos en el fichero el contenido final de las urls a descargar ANTES de LAMBDA:
with open(file_filtered_mixed_links_for_lambda, 'w') as output_file: # Cambié 'file' a 'output_file'
json.dump(filtered_mixed_links_for_lambda, output_file, indent=4)
print(f"Ya se pueden mirar los links previos a lambda en file_filtered_mixed_links_for_lambda")
time.sleep(15)
#yinks = filter(lambda href: href.startswith(url2), links_variable)
links_post_lambda = filter(lambda href: href.startswith(url1) or href.startswith(url2), filtered_mixed_links_for_lambda)
links_post_lambda_list = list(links_post_lambda)
# Probar qué hay en list_yinks, una vez pasado el LAMBDA:
with open(file_links_post_lambda_LIST_to_download, 'w') as output_file: # Cambié 'file' a 'output_file'
json.dump(links_post_lambda_list, output_file, indent=4)
print("Ya se puede mirar el listado tras lambda en file_links_post_lambda_LIST_to_download")
print("Longitud de links_post_lambda_list DESPUES de LAMBDA:", len(links_post_lambda_list))
input("Presiona Enter para continuar...")
print("downloading...")
cwd = os.getcwd()
for l1 in links_post_lambda_list:
print(f"Final Links post Lambda: {l1}")
r2 = s.get(l1)
div = r2.html.find('.td-content', first=True, clean=True)
print(f"Se están buscando los divs en {l1}. Esto es un div: {div}.")
if div:
print(f"Existe div: {div}")
html += div.html
with open("{}/{}.html".format(cwd, name), "wt") as f:
f.write(html)
print("generating pdf in " + name )
subprocess.run(["{}/weasy_print.sh".format(cwd), name])
if __name__ == '__main__':
s = rh.HTMLSession()
directories = [\
"setup",
"concepts",
"tasks",
"tutorials",
"reference",
]
directories_pairs = [("https://kubernetes.io/docs/{}/".format(n.lower()), n) for n in directories]
for url1, name in directories_pairs:
print("URL: " + url1, "Directorio: " + name)
print(name)
generate_directory_pdf(url1, name)
print("Generamos directorio con " + url1 + " y " + name )
I hope it works for you.
Best regards!!
@Conkernel Congrats! That's amazing, great work! I'm so glad that you were able to learn to code and solve your problem! Keep up with the good work!
Thanks. As I didn't know if you were goin to see this closed thread, I opened a new issue to chare the code with you. As it's not an issue, you can just close it, or do whatever you want with it, as my version need quite a lot of improvement in order to be pushished :)
Regards @dohsimpson
Hi,
Some time ago I managed to get the documentation in spanish using your solution. It works likr a charm. The probnlem is thar quite a lot of pages od the documentation are now written to Spanish. So, for me, it could be great to find a way to "print" all the existent doc in Spanish, and all those pages that are not translated, print them in english. Is it to much to ask for an update on the code in order to make it possible?
Thanks a lot for your effort