Closed Gayathri2993 closed 1 year ago
Thank you for you interest in improving Deepparse.
About an URL as a path, it is not a feature, but I can definitely see it. I will look at possible solutions to integrate such a feature with an S3 bucket.
If you retrain a model, I think using the same S3 bucket approach is interesting. If the feature is developed, it will be easy for you. All you will need is to save it after training. I guess botos3 with a predefined pattern URL would allow you to export to S3 and later update a .txt file with the latest URL. Or something like that. (Here a possible approach)
I found a library that can help handle for AWS S3 bucket. I have implemented a feature and only need to test if it works properly. I will try to finish it today but can promise it.
You can test the feature using this dev branch:
pip install -U git+https://github.com/GRAAL-Research/deepparse.git@aws_s3_uri
.
I also think it would be possible to use CloudPathLib to save a model when directly retraining into an S3-like bucket. I will like into it after your feedback on the prototype and when I will have more time to test it.
Hi Dave,
Thanks for the quick prompt response. I did try the feature you developed using CloudPathLib but i ended up with an error. Request you to please help me with this.
Code: !pip install cloudpathlib from cloudpathlib import CloudPath
address_parser = AddressParser(model_type="fasttext", device=0, path_to_retrained_model=CloudPath("s3://s3_path/fasttext.ckpt"))
address_parser("350 rue des Lilas Ouest Québec Québec G1L 1B6")
Error:
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/deepparse/parser/address_parser.py:240, in AddressParser.init(self, model_type, attention_mechanism, device, rounding, verbose, path_to_retrained_model, cache_dir, offline) 237 seq2seq_kwargs = {} # Empty for default settings 239 if path_to_retrained_model is not None: --> 240 if "s3://" in path_to_retrained_model: 241 if CloudPath is None: 242 raise ImportError( 243 "cloudpathlib needs to be installed to use a S3-like " "URI as path_to_retrained_model." 244 )
TypeError: argument of type 'S3Path' is not iterable
I'm on vacation for the next two weeks. I will take a look at it in mid-may.
cloudpathlib
maintainer here. @Gayathri2993, it may be the case that you need to additionally install the S3 dependencies, so instead of !pip install cloudpathlib
it should be !pip install cloudpathlib[s3]
@pjbull tks for the tips. I will take a look at this.
@Gayathri2993 Try this:
address_parser = AddressParser(model_type="fasttext", device=0, path_to_retrained_model="s3://s3_path/fasttext.ckpt")
The path does not need to be a CloudPath; we handle the conversion to it in the code base. I have added a catch for this kind of behaviour.
And I also added support to allow the use of a CloudPath directly as the argument instead of a string.
LMK if it works after updating using the branch. I've pushed some modifications.
Thank you guys for looking into this issue. I tried two methods to load the model directly from S3. Please find the details of two methods below
Method 1:
Code: !pip install cloudpathlib[s3] from cloudpathlib import CloudPath
address_parser = AddressParser(model_type="fasttext", device=0, path_to_retrained_model=CloudPath("s3://s3_path/fasttext.ckpt"))
address_parser("350 rue des Lilas Ouest Québec Québec G1L 1B6")
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/deepparse/parser/address_parser.py:225, in AddressParser.init(self, model_type, attention_mechanism, device, rounding, verbose, path_to_retrained_model, cache_dir, offline) 222 seq2seq_kwargs = {} # Empty for default settings 224 if path_to_retrained_model is not None: --> 225 checkpoint_weights = torch.load(path_to_retrained_model, map_location="cpu") 226 if checkpoint_weights.get("model_type") is None: 227 # Validate if we have the proper metadata, it has at least the parser model type 228 # if no other thing have been modified. 229 raise RuntimeError( 230 "You are not using the proper retrained checkpoint. " 231 "When we retrain an AddressParser, by default, we create a " (...) 234 "See AddressParser.retrain for more details." 235 )
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/torch/serialization.py:791, in load(f, map_location, pickle_module, weights_only, **pickle_load_args) 788 if 'encoding' not in pickle_load_args.keys(): 789 pickle_load_args['encoding'] = 'utf-8' --> 791 with _open_file_like(f, 'rb') as opened_file: 792 if _is_zipfile(opened_file): 793 # The zipfile reader is going to advance the current file position. 794 # If we want to actually tail call to torch.jit.load, we need to 795 # reset back to the original position. 796 orig_position = opened_file.tell()
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/torch/serialization.py:276, in _open_file_like(name_or_buffer, mode) 274 return _open_buffer_writer(name_or_buffer) 275 elif 'r' in mode: --> 276 return _open_buffer_reader(name_or_buffer) 277 else: 278 raise RuntimeError(f"Expected 'r' or 'w' in mode but got {mode}")
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/torch/serialization.py:261, in _open_buffer_reader.init(self, buffer) 259 def init(self, buffer): 260 super().init(buffer) --> 261 _check_seekable(buffer)
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/torch/serialization.py:357, in _check_seekable(f) 355 return True 356 except (io.UnsupportedOperation, AttributeError) as e: --> 357 raise_err_msg(["seek", "tell"], e) 358 return False
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/torch/serialization.py:350, in _check_seekable.
AttributeError: 'S3Path' object has no attribute 'seek'. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead.
Method 2: I tried the below method as well. But i got 'no file exists error'. I am quiet certain that the model exists in S3 and file path is correct as well.
address_parser = AddressParser(model_type="fasttext", device=0, path_to_retrained_model="s3://s3_path/fasttext.ckpt")
Error: File ~/anaconda3/envs/python3/lib/python3.10/site-packages/torch/serialization.py:791, in load(f, map_location, pickle_module, weights_only, **pickle_load_args) 788 if 'encoding' not in pickle_load_args.keys(): 789 pickle_load_args['encoding'] = 'utf-8' --> 791 with _open_file_like(f, 'rb') as opened_file: 792 if _is_zipfile(opened_file): 793 # The zipfile reader is going to advance the current file position. 794 # If we want to actually tail call to torch.jit.load, we need to 795 # reset back to the original position. 796 orig_position = opened_file.tell()
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/torch/serialization.py:271, in _open_file_like(name_or_buffer, mode) 269 def _open_file_like(name_or_buffer, mode): 270 if _is_path(name_or_buffer): --> 271 return _open_file(name_or_buffer, mode) 272 else: 273 if 'w' in mode:
File ~/anaconda3/envs/python3/lib/python3.10/site-packages/torch/serialization.py:252, in _open_file.init(self, name, mode) 251 def init(self, name, mode): --> 252 super().init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 's3://
I also tried loading the model directly from the library in my sagemaker instance. But i get duplicated results everytime i try to pass multiple addresses. Surprisingly the same code works as expected in sagemaker studio lab (free ML environment platform). Here's a snippet from sagemaker instance. Can you please let me know why am i getting duplicated results here.
I will take a look next week; I have a busy week.
Uhmm, strange behaviour for the duplicate.
Can I get a restricted secret key to the S3 bucket? It will be easier for me to test. You can forward it to me using my email david.beauchemin@ift.ulaval.ca.
Dave, I am not sure if i can share restricted secret key because I am using the S3 for official purpose. Is there any other way to test this?
@Gayathri2993 No problem, it was more that I didn't want to take the time to set up all that in my personal account. Right now, I plan to look at it next Thursday since I have other deadlines the following week.
@davebulaval It would be super helpful if you could prioritise the duplicate records issue whenever you start working on this. Since I have millions of records to run, I would like to run deeparse directly in AWS SageMaker.
Hi @davebulaval. I just wanted to kindly check in and see if there have been any updates regarding the issues we discussed. I understand that you have other deadlines this week, and I appreciate your time. When it's convenient for you, I would appreciate an update on the progress or any further steps that need to be taken.
Do the following to see if it work now.
pip install -U git+https://github.com/GRAAL-Research/deepparse.git@dev
.uri = "s3://<path>/fasttext.ckpt"
address_parser = AddressParser(model_type="fasttext", path_to_retrained_model=uri)
parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")
or
uri = CloudPath("s3://deepparse/fasttext.ckpt")
address_parser = AddressParser(model_type="fasttext", path_to_retrained_model=uri)
parse_address = address_parser("350 rue des Lilas Ouest Quebec city Quebec G1L 1B6")
, we now support both approaches.
On my side, I was able to use a URI to download a model to parse and also to upload a model after retraining to an URI directory bucket using S3.
I also tried loading the model directly from the library in my sagemaker instance. But i get duplicated results everytime i try to pass multiple addresses. Surprisingly the same code works as expected in sagemaker studio lab (free ML environment platform). Here's a snippet from sagemaker instance. Can you please let me know why am i getting duplicated results here.
For this, I have no idea how to investigate that and the reason for this behaviour.
I found the problem. It is fixed in the branch 9943f04 commit.
I have merged all the content into dev since the content seems to be ready for a release.
@davebulaval Thankyou so much for your efforts. Both the issues are resolved. The code is working in aws sagemaker and I am not getting duplicate records as well.
I just had a quick question. Do you have any idea on maximum number of records it can parse at a one go?
Great!
If you use a GPU, it depends on your GPU memory size. From our experimentation with batch processing in Deepparse, the optimal size is around 256. Beyond that, it can take longer per address (due to IO) and data preparation. To alleviate that (we have not tested with more workers), you can increase the `num_workers' to something like 4. Too much is not better.
If you do not use a GPU, it depends on the number of CPU and RAM.
The faster is with a GPU, of course.
Release in 0.9.7 #195
What is the best way to deploy deepparse in AWS sagemaker. I have the downloaded model (fasttext) directly and stored it in S3 since i do not want to download the model everytime i run l. I have given the S3 path as path_to_retrained_model. But somehow the the function is not able to read the path file from S3. The error says no such file exists. I am very positive that the file path is correct and model exists in S3. Is there something I am missing or could you please let me know if there is an efficient way to deploy the model. I have millions of records to run this model on.