GRAAL-Research / deepparse

Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning
https://deepparse.org/
GNU Lesser General Public License v3.0
299 stars 30 forks source link

Missing file for offline use #215

Closed MarcOlbrich closed 8 months ago

MarcOlbrich commented 8 months ago

Hello, i want to use the deep-parse model bpemp offline in a Databricks-Notebook with a mounted directory/dbfs/mnt/data/. I downloaded your modelbpemb.ckpt from the given URL, an put it in the said directory. Here is the path to it:

/dbfs/mnt/data/bpemb.ckpt.

Then i set thecache_dir variable to this directory-path as described in your Docs, and call the address parser:

address_parser = AddressParser(model_type="bpemb", device="cpu", cache_dir="/dbfs/mnt/data/", offline=True)

i get the error, that the file bpemb.version is missing (see the message below). But i cannot find such a file in your docs/descriptions.

Can you help me, get this running? What should i do to run this model offline with loading the previously donwloaded model from a specific directory? I also couldn't find a proper example for this use case in your docs. Kind Regards Marc


FileNotFoundError Traceback (most recent call last) File , line 1 ----> 1 address_parser = AddressParser(model_type="bpemb", device="cpu", cache_dir="/dbfs/mnt/data/", offline=True)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-8604e35f-af67-4454-9b6b-c7e865955f0d/lib/python3.10/site-packages/deepparse/parser/address_parser.py:285, in AddressParser.init(self, model_type, attention_mechanism, device, rounding, verbose, path_to_retrained_model, cache_dir, offline) 282 self.named_parser = named_parser 284 self.model_type, self._model_type_formatted = handle_model_name(model_type, attention_mechanism) --> 285 self._setup_model( 286 verbose=self.verbose, 287 path_to_retrained_model=path_to_retrained_model, 288 prediction_layer_len=self.tags_converter.dim, 289 attention_mechanism=attention_mechanism, 290 seq2seq_kwargs=seq2seq_kwargs, 291 cache_dir=cache_dir, 292 offline=offline, 293 ) 294 self.model.eval()

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-8604e35f-af67-4454-9b6b-c7e865955f0d/lib/python3.10/site-packages/deepparse/parser/address_parser.py:1159, in AddressParser._setup_model(self, verbose, path_to_retrained_model, prediction_layer_len, attention_mechanism, seq2seq_kwargs, cache_dir, offline) 1155 if cache_dir is None: 1156 # Set to default cache_path value 1157 cache_dir = CACHE_PATH -> 1159 self.model = ModelFactory().create( 1160 model_type=self.model_type, 1161 cache_dir=cache_dir, 1162 device=self.device, 1163 output_size=prediction_layer_len, 1164 attention_mechanism=attention_mechanism, 1165 path_to_retrained_model=path_to_retrained_model, 1166 offline=offline, 1167 verbose=verbose, 1168 **seq2seq_kwargs, 1169 ) 1171 embeddings_model = EmbeddingsModelFactory().create( 1172 embedding_model_type=self.model_type, cache_dir=cache_dir, verbose=verbose 1173 ) 1174 vectorizer = VectorizerFactory().create(embeddings_model)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-8604e35f-af67-4454-9b6b-c7e865955f0d/lib/python3.10/site-packages/deepparse/network/model_factory.py:58, in ModelFactory.create(self, model_type, cache_dir, device, output_size, attention_mechanism, path_to_retrained_model, offline, verbose, seq2seq_kwargs) 46 model = FastTextSeq2SeqModel( 47 cache_dir=cache_dir, 48 device=device, (...) 54 seq2seq_kwargs, 55 ) 57 elif "bpemb" in model_type: ---> 58 model = BPEmbSeq2SeqModel( 59 cache_dir=cache_dir, 60 device=device, 61 output_size=output_size, 62 verbose=verbose, 63 path_to_retrained_model=path_to_retrained_model, 64 attention_mechanism=attention_mechanism, 65 offline=offline, 66 **seq2seq_kwargs, 67 ) 69 else: 70 raise NotImplementedError( 71 f""" 72 There is no {model_type} network implemented. model_type should be either fasttext or bpemb 73 """ 74 )

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-8604e35f-af67-4454-9b6b-c7e865955f0d/lib/python3.10/site-packages/deepparse/network/bpemb_seq2seq.py:75, in BPEmbSeq2SeqModel.init(self, cache_dir, device, input_size, encoder_hidden_size, encoder_num_layers, decoder_hidden_size, decoder_num_layers, output_size, attention_mechanism, verbose, path_to_retrained_model, pre_trained_weights, offline) 72 elif pre_trained_weights: 73 # Means we use the pretrained weights 74 self._load_pre_trained_weights(model_weights_name, cache_dir=cache_dir, offline=offline) ---> 75 version = self._load_version(model_type=model_weights_name, cache_dir=cache_dir) 76 else: 77 version = ""

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-8604e35f-af67-4454-9b6b-c7e865955f0d/lib/python3.10/site-packages/deepparse/network/seq2seq.py:147, in Seq2SeqModel._load_version(model_type, cache_dir) 133 @staticmethod 134 def _load_version(model_type: str, cache_dir: str) -> str: 135 """ 136 Method to load the local hashed version of the model as an attribute. 137 (...) 145 146 """ --> 147 with open(os.path.join(cache_dir, model_type + ".version"), encoding="utf-8") as local_model_hash_file: 148 return local_model_hash_file.readline().strip()

FileNotFoundError: [Errno 2] No such file or directory: '/dbfs/mnt/data/bpemb.version'

github-actions[bot] commented 8 months ago

Thank you for you interest in improving Deepparse.

MarcOlbrich commented 8 months ago

I found the answer in the allready closed issues. To run the address parser localy offline , i need the directory multi, as well as the files bpemp.ckpt , bpemp.version, that can be found in the local cache directory : C:/ user/<username>/.cache/deepparse, after the model was successfully initialized the first time using the default download. Copying these data in a desired directory (e.g.: .../myrepo/deepparse_model) and call: address_parser = AddressParser(model_type="bpemb", device="cpu", cache_dir="/myrepo/deepparse_model", offline=True) solves the problem.

MarcOlbrich commented 8 months ago

I found the answer in the allready closed issues. To run the address parser localy offline , i need the directory multi, as well as the files bpemp.ckpt , bpemp.version, that can be found in the local cache directory : C:/ user//.cache/deepparse, after the model was successfully initialized the first time using the default download. Copying these data in a desired directory (e.g.: .../myrepo/deepparse_model) and call: address_parser = AddressParser(model_type="bpemb", device="cpu", cache_dir="/myrepo/deepparse_model", offline=True) solves the problem.

davebulaval commented 8 months ago

We also offer a CLI with such options to download to a specific place for offline parsing.