UKPLab / EasyNMT

Easy to use, state-of-the-art Neural Machine Translation for 100+ languages
Apache License 2.0
1.17k stars 116 forks source link

some questions #61

Open kucingkembar opened 2 years ago

kucingkembar commented 2 years ago

hi, sorry for my bad English I have some questions, I hope you will answer them:

  1. if I use python IDLE (short for Integrated Development and Learning Environment) the download for new Models, it "stuck" at 1kbps, and when downloading about more than 20%, the program takes quite a lot of resources (RAM and CPU) when downloading the module, but if you run it using windows CMD the problem does not exist, can you examine about this?
  2. this software load about 1.8GB of data to the RAM, but something I don't understand is: why you still need the internet connection if you have 1.8GB of data in your RAM?
  3. I used the default model (it is called "opus mt" if I am not wrong), that model is capable to translate 186 languages, if I calculate 1.8GB/186 language; each language require about 10Mb of ram, can we just load only a specific language like English to French or English to German, I think we can make this software work only with 20Mb of RAM

thank you for reading, and the great works, have a nice day

nreimers commented 2 years ago

Hi, 1) Sadly don't know why this happens 2) You can configure huggingface transformers library to work in offline mode 3) Each language direction is it's own model of ~350MB in size.

kucingkembar commented 2 years ago

sorry for the late reply, and thank you for your reply

  1. I hope you test it and tell the user about this problem in the description
  2. is the "offline mode" really offline or does it needs local/Eternet with active server to use it? if not, can you give the tutorial?
  3. is the ~350MB in size HDD/SSD size, or RAM size? if RAM size, can you give the tutorial?

thank you again for reply, have a nice day

nreimers commented 2 years ago

2) The models need to be downloaded at least once. They are then cached on disc 3) Size on disc. Transformers models require quite much memory.

kucingkembar commented 2 years ago

thank again for the reply

  1. I know the program requires a module to be downloaded at least once as They are then cached on a disc, but, after the module loaded, the program still requires an internet connection to provide the correct result, this is the evidence:

the code:

from easynmt import EasyNMT
model = EasyNMT('opus-mt')

document = """Berlin is the capital and largest city of Germany by both area and population.[6][7] Its 3,769,495 inhabitants as of 31 December 2019[2] make it the most-populous city of the European Union, according to population within city limits.[8] The city is also one of Germany's 16 federal states. It is surrounded by the state of Brandenburg, and contiguous with Potsdam, Brandenburg's capital. The two cities are at the center of the Berlin-Brandenburg capital region, which is, with about six million inhabitants and an area of more than 30,000 km2,[9] Germany's third-largest metropolitan region after the Rhine-Ruhr and Rhine-Main regions. Berlin straddles the banks of the River Spree, which flows into the River Havel (a tributary of the River Elbe) in the western borough of Spandau. Among the city's main topographical features are the many lakes in the western and southeastern boroughs formed by the Spree, Havel, and Dahme rivers (the largest of which is Lake Müggelsee). Due to its location in the European Plain, Berlin is influenced by a temperate seasonal climate. About one-third of the city's area is composed of forests, parks, gardens, rivers, canals and lakes.[10] The city lies in the Central German dialect area, the Berlin dialect being a variant of the Lusatian-New Marchian dialects.

First documented in the 13th century and at the crossing of two important historic trade routes,[11] Berlin became the capital of the Margraviate of Brandenburg (1417–1701), the Kingdom of Prussia (1701–1918), the German Empire (1871–1918), the Weimar Republic (1919–1933), and the Third Reich (1933–1945).[12] Berlin in the 1920s was the third-largest municipality in the world.[13] After World War II and its subsequent occupation by the victorious countries, the city was divided; West Berlin became a de facto West German exclave, surrounded by the Berlin Wall (1961–1989) and East German territory.[14] East Berlin was declared capital of East Germany, while Bonn became the West German capital. Following German reunification in 1990, Berlin once again became the capital of all of Germany.

Berlin is a world city of culture, politics, media and science.[15][16][17][18] Its economy is based on high-tech firms and the service sector, encompassing a diverse range of creative industries, research facilities, media corporations and convention venues.[19][20] Berlin serves as a continental hub for air and rail traffic and has a highly complex public transportation network. The metropolis is a popular tourist destination.[21] Significant industries also include IT, pharmaceuticals, biomedical engineering, clean tech, biotechnology, construction and electronics."""

#Translate the document to German
print(model.translate(document, target_lang='de'))

when internet is connected:

Microsoft Windows [Version 10.0.22000.556]
(c) Microsoft Corporation. All rights reserved.

C:\Users\GIGABYTE>C:\Users\GIGABYTE\Desktop\EasyNMT.py
Berlin ist die Hauptstadt und größte Stadt Deutschlands sowohl in der Region als auch in der Bevölkerung.[6][7] Die 3.769,495 Einwohner machen sie zum 31. Dezember 2019[2] zur bevölkerungsreichsten Stadt der Europäischen Union, nach der Bevölkerung innerhalb der Stadtgrenzen.[8] Die Stadt ist auch einer der 16 Bundesländer Deutschlands. Sie ist von Brandenburg umgeben und mit Potsdam, der Hauptstadt Brandenburgs, verbunden. Die beiden Städte befinden sich im Zentrum der Hauptstadtregion Berlin-Brandenburg, mit rund sechs Millionen Einwohnern und einer Fläche von mehr als 30.000 km2,[9] Deutschlands drittgrößter Metropolregion nach den Regionen Rhein-Ruhr und Rhein-Main. Berlin erstreckt sich über das Ufer der Spree, die in den Havel (ein Nebenfluss der Elbe) im westlichen Bezirk Spandau mündet. Zu den wichtigsten topographischen Merkmalen der Stadt gehören die zahlreichen Seen in den westlichen und südöstlichen Stadtteilen, die von den Flüssen Spree, Havel und Dahme gebildet wurden (der größte davon ist der Müggelsee). Aufgrund seiner Lage in der Europäischen Ebene wird Berlin von einem gemäßigten saisonalen Klima beeinflusst. Etwa ein Drittel des Stadtgebiets besteht aus Wäldern, Parks, Gärten, Flüssen, Kanälen und Seen.[10] Die Stadt liegt im mitteldeutschen Dialektgebiet, der Berliner Dialekt ist eine Variante der Lusatian-New Marchian Dialekte.

Zum ersten Mal dokumentiert im 13. Jahrhundert und an der Kreuzung zweier wichtiger historischer Handelswege,[11] wurde Berlin die Hauptstadt der Mark Brandenburg (1417–1701), des Königreichs Preußen (1701–1918), des Deutschen Reiches (1871–1918), der Weimarer Republik (1919–1933) und des Dritten Reiches (1933–1945).[12] Berlin war in den 1920er Jahren die drittgrößte Gemeinde der Welt.[13] Nach dem Zweiten Weltkrieg und seiner anschließenden Besetzung durch die siegreichen Länder wurde die Stadt geteilt; West-Berlin wurde de facto eine westdeutsche Exklave, umgeben von der Berliner Mauer (1961–1989) und dem ostdeutschen Territorium.[14] Ost-Berlin wurde zur Hauptstadt Ostdeutschlands erklärt, während Bonn zur westdeutschen Hauptstadt wurde. Nach der deutschen Wiedervereinigung 1990 wurde Berlin wieder zur Hauptstadt ganz Deutschlands.

Berlin ist eine Weltstadt der Kultur, Politik, Medien und Wissenschaft.[15][16][17][18] Seine Wirtschaft basiert auf High-Tech-Firmen und dem Dienstleistungssektor, der eine Vielzahl von Kreativindustrien, Forschungseinrichtungen, Medienunternehmen und Kongressstätten umfasst.[19][20] Berlin dient als kontinentale Drehscheibe für den Luft- und Schienenverkehr und verfügt über ein hochkomplexes öffentliches Verkehrsnetz. Die Metropole ist ein beliebtes Touristenziel.[21] Zu den bedeutenden Industriezweigen gehören auch IT, Pharmazie, biomedizinische Technik, Clean Tech, Biotechnologie, Bauwesen und Elektronik.

C:\Users\GIGABYTE>

when the internet is not connected:

Microsoft Windows [Version 10.0.22000.556]
(c) Microsoft Corporation. All rights reserved.

C:\Users\GIGABYTE>C:\Users\GIGABYTE\Desktop\EasyNMT.py
Exception: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/Helsinki-NLP/opus-mt-en-de (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001BD3E507760>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))
Traceback (most recent call last):
  File "C:\Python\Python39\lib\site-packages\urllib3\connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "C:\Python\Python39\lib\site-packages\urllib3\util\connection.py", line 73, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "C:\Python\Python39\lib\socket.py", line 954, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno 11001] getaddrinfo failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Python\Python39\lib\site-packages\urllib3\connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "C:\Python\Python39\lib\site-packages\urllib3\connectionpool.py", line 382, in _make_request
    self._validate_conn(conn)
  File "C:\Python\Python39\lib\site-packages\urllib3\connectionpool.py", line 1010, in _validate_conn
    conn.connect()
  File "C:\Python\Python39\lib\site-packages\urllib3\connection.py", line 358, in connect
    conn = self._new_conn()
  File "C:\Python\Python39\lib\site-packages\urllib3\connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x000001BD3E507760>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Python\Python39\lib\site-packages\requests\adapters.py", line 440, in send
    resp = conn.urlopen(
  File "C:\Python\Python39\lib\site-packages\urllib3\connectionpool.py", line 755, in urlopen
    retries = retries.increment(
  File "C:\Python\Python39\lib\site-packages\urllib3\util\retry.py", line 574, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/Helsinki-NLP/opus-mt-en-de (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001BD3E507760>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\GIGABYTE\Desktop\EasyNMT.py", line 24, in <module>
    print(model.translate(document, target_lang='de'))
  File "C:\Python\Python39\lib\site-packages\easynmt\EasyNMT.py", line 154, in translate
    raise e
  File "C:\Python\Python39\lib\site-packages\easynmt\EasyNMT.py", line 149, in translate
    translated = self.translate(**method_args)
  File "C:\Python\Python39\lib\site-packages\easynmt\EasyNMT.py", line 181, in translate
    translated_sentences = self.translate_sentences(splitted_sentences, target_lang=target_lang, source_lang=source_lang, show_progress_bar=show_progress_bar, beam_size=beam_size, batch_size=batch_size, **kwargs)
  File "C:\Python\Python39\lib\site-packages\easynmt\EasyNMT.py", line 278, in translate_sentences
    output.extend(self.translator.translate_sentences(sentences_sorted[start_idx:start_idx+batch_size], source_lang=source_lang, target_lang=target_lang, beam_size=beam_size, device=self.device, **kwargs))
  File "C:\Python\Python39\lib\site-packages\easynmt\models\OpusMT.py", line 40, in translate_sentences
    tokenizer, model = self.load_model(model_name)
  File "C:\Python\Python39\lib\site-packages\easynmt\models\OpusMT.py", line 22, in load_model
    tokenizer = MarianTokenizer.from_pretrained(model_name)
  File "C:\Python\Python39\lib\site-packages\transformers\tokenization_utils_base.py", line 1654, in from_pretrained
    fast_tokenizer_file = get_fast_tokenizer_file(
  File "C:\Python\Python39\lib\site-packages\transformers\tokenization_utils_base.py", line 3486, in get_fast_tokenizer_file
    all_files = get_list_of_files(
  File "C:\Python\Python39\lib\site-packages\transformers\file_utils.py", line 2103, in get_list_of_files
    return list_repo_files(path_or_repo, revision=revision, token=token)
  File "C:\Python\Python39\lib\site-packages\huggingface_hub\hf_api.py", line 601, in list_repo_files
    info = self.model_info(
  File "C:\Python\Python39\lib\site-packages\huggingface_hub\hf_api.py", line 584, in model_info
    r = requests.get(path, headers=headers, timeout=timeout)
  File "C:\Python\Python39\lib\site-packages\requests\api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Python\Python39\lib\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Python\Python39\lib\site-packages\requests\sessions.py", line 529, in request
    resp = self.send(prep, **send_kwargs)
  File "C:\Python\Python39\lib\site-packages\requests\sessions.py", line 645, in send
    r = adapter.send(request, **kwargs)
  File "C:\Python\Python39\lib\site-packages\requests\adapters.py", line 519, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /api/models/Helsinki-NLP/opus-mt-en-de (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001BD3E507760>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed'))

C:\Users\GIGABYTE>
  1. I think there is no way to reduce the 1.8GB or loaded RAM data, so I mark this question solved