Closed jarkkojarvinen closed 4 months ago
Can you share your dataset so I can test it?
david.beauchemin@baseline.quebec
Update:
I am currently training using your script, but for fasttext
since the download of BPEmp is still broken. I have written to the package's maintenor to accelerate the fix.
Thanks. I tried also with fasttext and same issue occurred.
Uhm. I got the same pattern. I have also tried using a GPU and have the same problem. The problem occurs during validation on my side. The error occurs at the same data point. It feels like a problem on the dataset to me. Can you look into your dataset? We use a seed for reproducibility; thus, if the problem occurs at the same data point, it feels like a problem.
Yan can use our code base to create the dataloader with the default seed value (42
). We have a protected method to create the dataloader (see here).
ouch, I think I found the reason. In our address source data (National Land Survey) there has been inserted some new addresses with two spaces e.g. "Räyskälän kantatie"
and this has been annotated [StreetName, StreetName, StreetName]. There are couple of similar addresses and we didn't clean the source carefully. Strip was done but not double spaces.
I believe that when training is done with deepparse, the extra space will be stripped of and then training will fail due of mismatch of address token and tag amount. Haven't yet confirmed does this help, but seems very clear to me.
My team and I are also working for developing a custom address parser, and we faced the same issue as well. No warning, no error, no pattern in error (might be the first or the third epoch). The training process just stops randomly, regardless of the size of the data we use (might be 10k, 1m, etc.). There are some times we managed to complete this step but this randomness is a thorn in our sides.
We use an aggressive preprocessing pipeline, as shown below, to get rid of any invalid character. The training container did not raise any error before startning the process and I suspect that mismatched token/tag is not the issue in our case.
lower_cleaning_table = str.maketrans('CÇGĞIİOÖSŞUÜ', 'cçgğıioösşuü')
def lower_cleaning(address: str) -> str:
return address.translate(lower_cleaning_table).lower()
def multi_whitespaces_cleaning(address: str) -> str:
return re.sub(r'\s{2,}', ' ', address)
def trailing_whitespace_cleaning(address: str) -> str:
return re.sub(r'(^\s{1,}|\s{1,}$)', '', address)
def invalid_char_cleaning(address: str) -> str:
return re.sub(r'(\s*-\s*|\s*\+\s*|\s*&\s*|<null>|[^abcçdefgğhiıjklmnoöprsştuüvyzqwx 0-9]{1,})', '', address)
def cleaning(address: str) -> str:
return invalid_char_cleaning(
trailing_whitespace_cleaning(
multi_whitespaces_cleaning(
lower_cleaning(
address
))))
I created simple validation tool to validate CSV dataset. In my case this highlighted issues and I could easily fix those. Please be free to use.
import ast
import pandas as pd
from tqdm import tqdm
tqdm.pandas()
def validate_address_tags(address, tags):
"""
Validates that the number of tags matches the number of words in the address.
Parameters:
address (str): The address string.
tags (str): The tags string.
Returns:
bool: True if the number of tags matches the number of words, False otherwise.
"""
try:
tags_list = ast.literal_eval(
tags
) # Convert string representation of list to actual list
except (ValueError, SyntaxError) as e:
print(f"Invalid format for tags: {tags}. Error: {e}")
return False
address_parts = address.split()
return len(address_parts) == len(tags_list)
def validate_dataset(file_path):
"""
Validates the dataset by checking if the number of tags matches the number of words in the address.
Parameters:
file_path (str): The path to the CSV file.
"""
df = pd.read_csv(file_path)
# Apply validation to each row and store results in a new column
df["is_valid"] = df.progress_apply(
lambda row: validate_address_tags(row["Address"], row["Tags"]), axis=1
)
# Identify rows that failed validation
invalid_rows = df[~df["is_valid"]]
if invalid_rows.empty:
print("All rows are valid!")
else:
print("Some rows have validation issues:")
for index, row in invalid_rows.iterrows():
print(
f"Validation failed at index {index}: Address - '{row['Address']}' | Tags - {row['Tags']}"
)
# Usage
validate_dataset(csv_train_filename)
validate_dataset(csv_test_filename)
I created simple validation tool to validate CSV dataset. In my case this highlighted issues and I could easily fix those. Please be free to use.
import ast import pandas as pd from tqdm import tqdm tqdm.pandas() def validate_address_tags(address, tags): """ Validates that the number of tags matches the number of words in the address. Parameters: address (str): The address string. tags (str): The tags string. Returns: bool: True if the number of tags matches the number of words, False otherwise. """ try: tags_list = ast.literal_eval( tags ) # Convert string representation of list to actual list except (ValueError, SyntaxError) as e: print(f"Invalid format for tags: {tags}. Error: {e}") return False address_parts = address.split() return len(address_parts) == len(tags_list) def validate_dataset(file_path): """ Validates the dataset by checking if the number of tags matches the number of words in the address. Parameters: file_path (str): The path to the CSV file. """ df = pd.read_csv(file_path) # Apply validation to each row and store results in a new column df["is_valid"] = df.progress_apply( lambda row: validate_address_tags(row["Address"], row["Tags"]), axis=1 ) # Identify rows that failed validation invalid_rows = df[~df["is_valid"]] if invalid_rows.empty: print("All rows are valid!") else: print("Some rows have validation issues:") for index, row in invalid_rows.iterrows(): print( f"Validation failed at index {index}: Address - '{row['Address']}' | Tags - {row['Tags']}" ) # Usage validate_dataset(csv_train_filename) validate_dataset(csv_test_filename)
We already applied this sort of validation, but we did not have any for double whitespace. I will add this verification for further users.
My team and I are also working for developing a custom address parser, and we faced the same issue as well. No warning, no error, no pattern in error (might be the first or the third epoch). The training process just stops randomly, regardless of the size of the data we use (might be 10k, 1m, etc.). There are some times we managed to complete this step but this randomness is a thorn in our sides.
We use an aggressive preprocessing pipeline, as shown below, to get rid of any invalid character. The training container did not raise any error before startning the process and I suspect that mismatched token/tag is not the issue in our case.
lower_cleaning_table = str.maketrans('CÇGĞIİOÖSŞUÜ', 'cçgğıioösşuü') def lower_cleaning(address: str) -> str: return address.translate(lower_cleaning_table).lower() def multi_whitespaces_cleaning(address: str) -> str: return re.sub(r'\s{2,}', ' ', address) def trailing_whitespace_cleaning(address: str) -> str: return re.sub(r'(^\s{1,}|\s{1,}$)', '', address) def invalid_char_cleaning(address: str) -> str: return re.sub(r'(\s*-\s*|\s*\+\s*|\s*&\s*|<null>|[^abcçdefgğhiıjklmnoöprsştuüvyzqwx 0-9]{1,})', '', address) def cleaning(address: str) -> str: return invalid_char_cleaning( trailing_whitespace_cleaning( multi_whitespaces_cleaning( lower_cleaning( address ))))
Can you try again the training but with verbose=True
?
What is your poutyne version?
I remember that at one point, we had many problems with warnings, and we installed a context manager that captured them. I think the context manager is blocking the errors.
My team and I are also working for developing a custom address parser, and we faced the same issue as well. No warning, no error, no pattern in error (might be the first or the third epoch). The training process just stops randomly, regardless of the size of the data we use (might be 10k, 1m, etc.). There are some times we managed to complete this step but this randomness is a thorn in our sides. We use an aggressive preprocessing pipeline, as shown below, to get rid of any invalid character. The training container did not raise any error before startning the process and I suspect that mismatched token/tag is not the issue in our case.
lower_cleaning_table = str.maketrans('CÇGĞIİOÖSŞUÜ', 'cçgğıioösşuü') def lower_cleaning(address: str) -> str: return address.translate(lower_cleaning_table).lower() def multi_whitespaces_cleaning(address: str) -> str: return re.sub(r'\s{2,}', ' ', address) def trailing_whitespace_cleaning(address: str) -> str: return re.sub(r'(^\s{1,}|\s{1,}$)', '', address) def invalid_char_cleaning(address: str) -> str: return re.sub(r'(\s*-\s*|\s*\+\s*|\s*&\s*|<null>|[^abcçdefgğhiıjklmnoöprsştuüvyzqwx 0-9]{1,})', '', address) def cleaning(address: str) -> str: return invalid_char_cleaning( trailing_whitespace_cleaning( multi_whitespaces_cleaning( lower_cleaning( address ))))
Can you try again the training but with
verbose=True
?What is your poutyne version?
I remember that at one point, we had many problems with warnings, and we installed a context manager that captured them. I think the context manager is blocking the errors.
I ran the code with verbose=True
as you say, still no errors or warnings
## main.py
address_parser = AddressParser(model_type='bpemb', device=0)
address_parser.retrain(
dataContainer
,verbose=True
,num_workers=8
,batch_size=254
,train_ratio=1/3
,epochs=8
,learning_rate=0.001
,prediction_tags=tags
,logging_path=logging_path
,name_of_the_retrain_parser='CustomParser'
)
## conda environment (deepparse)
(deepparse) E:\Workbench\customparser>python main.py
Loading the embeddings model
Epoch: 1/8 Step: 246/263 93.54% |██████████████████▋ |ETA: 8.82s val_loss: 11.644591 val_accuracy: 45.693340
(deepparse) E:\Workbench\customparser>
And for your second question
(deepparse) E:\Workbench\customparser>python -c "import poutyne, deepparse; print(f'{poutyne.__version__=}, {deepparse.__version__=}')"
poutyne.__version__='1.17.1', deepparse.__version__='0.9.9'
I confirm that training is working again! Thank you for support.
I used the script I gave earlier and fixed tags mismatches caused by extra spaces in our address sources which I used to generate training material with annotations e.g. Some Street
was annotated StreetName, StreetName, StreetName
. Double spaces should be removed before annotation.
After fixing these entries training was successful. The training took 25 hours with my databricks machine and model results were good.
I confirm that training is working again! Thank you for support. I used the script I gave earlier and fixed tags mismatches caused by extra spaces in our address sources which I used to generate training material with annotations e.g.
Some Street
was annotatedStreetName, StreetName, StreetName
. Double spaces should be removed before annotation. After fixing these entries training was successful. The training took 25 hours with my databricks machine and model results were good.
The next release will include a validation for this.
My team and I are also working for developing a custom address parser, and we faced the same issue as well. No warning, no error, no pattern in error (might be the first or the third epoch). The training process just stops randomly, regardless of the size of the data we use (might be 10k, 1m, etc.). There are some times we managed to complete this step but this randomness is a thorn in our sides. We use an aggressive preprocessing pipeline, as shown below, to get rid of any invalid character. The training container did not raise any error before startning the process and I suspect that mismatched token/tag is not the issue in our case.
lower_cleaning_table = str.maketrans('CÇGĞIİOÖSŞUÜ', 'cçgğıioösşuü') def lower_cleaning(address: str) -> str: return address.translate(lower_cleaning_table).lower() def multi_whitespaces_cleaning(address: str) -> str: return re.sub(r'\s{2,}', ' ', address) def trailing_whitespace_cleaning(address: str) -> str: return re.sub(r'(^\s{1,}|\s{1,}$)', '', address) def invalid_char_cleaning(address: str) -> str: return re.sub(r'(\s*-\s*|\s*\+\s*|\s*&\s*|<null>|[^abcçdefgğhiıjklmnoöprsştuüvyzqwx 0-9]{1,})', '', address) def cleaning(address: str) -> str: return invalid_char_cleaning( trailing_whitespace_cleaning( multi_whitespaces_cleaning( lower_cleaning( address ))))
Can you try again the training but with
verbose=True
? What is your poutyne version? I remember that at one point, we had many problems with warnings, and we installed a context manager that captured them. I think the context manager is blocking the errors.I ran the code with
verbose=True
as you say, still no errors or warnings## main.py address_parser = AddressParser(model_type='bpemb', device=0) address_parser.retrain( dataContainer ,verbose=True ,num_workers=8 ,batch_size=254 ,train_ratio=1/3 ,epochs=8 ,learning_rate=0.001 ,prediction_tags=tags ,logging_path=logging_path ,name_of_the_retrain_parser='CustomParser' ) ## conda environment (deepparse) (deepparse) E:\Workbench\customparser>python main.py Loading the embeddings model Epoch: 1/8 Step: 246/263 93.54% |██████████████████▋ |ETA: 8.82s val_loss: 11.644591 val_accuracy: 45.693340 (deepparse) E:\Workbench\customparser>
And for your second question
(deepparse) E:\Workbench\customparser>python -c "import poutyne, deepparse; print(f'{poutyne.__version__=}, {deepparse.__version__=}')" poutyne.__version__='1.17.1', deepparse.__version__='0.9.9'
Can you test with the installation of Deepparse in the branch [remove_context_manager](https://github.com/GRAAL-Research/deepparse/tree/remove_context_manager)
?
To install it, do pip install -U git+https://github.com/GRAAL-Research/deepparse.git@remove_context_manager
My team and I are also working for developing a custom address parser, and we faced the same issue as well. No warning, no error, no pattern in error (might be the first or the third epoch). The training process just stops randomly, regardless of the size of the data we use (might be 10k, 1m, etc.). There are some times we managed to complete this step but this randomness is a thorn in our sides. We use an aggressive preprocessing pipeline, as shown below, to get rid of any invalid character. The training container did not raise any error before startning the process and I suspect that mismatched token/tag is not the issue in our case.
lower_cleaning_table = str.maketrans('CÇGĞIİOÖSŞUÜ', 'cçgğıioösşuü') def lower_cleaning(address: str) -> str: return address.translate(lower_cleaning_table).lower() def multi_whitespaces_cleaning(address: str) -> str: return re.sub(r'\s{2,}', ' ', address) def trailing_whitespace_cleaning(address: str) -> str: return re.sub(r'(^\s{1,}|\s{1,}$)', '', address) def invalid_char_cleaning(address: str) -> str: return re.sub(r'(\s*-\s*|\s*\+\s*|\s*&\s*|<null>|[^abcçdefgğhiıjklmnoöprsştuüvyzqwx 0-9]{1,})', '', address) def cleaning(address: str) -> str: return invalid_char_cleaning( trailing_whitespace_cleaning( multi_whitespaces_cleaning( lower_cleaning( address ))))
Can you try again the training but with
verbose=True
? What is your poutyne version? I remember that at one point, we had many problems with warnings, and we installed a context manager that captured them. I think the context manager is blocking the errors.I ran the code with
verbose=True
as you say, still no errors or warnings## main.py address_parser = AddressParser(model_type='bpemb', device=0) address_parser.retrain( dataContainer ,verbose=True ,num_workers=8 ,batch_size=254 ,train_ratio=1/3 ,epochs=8 ,learning_rate=0.001 ,prediction_tags=tags ,logging_path=logging_path ,name_of_the_retrain_parser='CustomParser' ) ## conda environment (deepparse) (deepparse) E:\Workbench\customparser>python main.py Loading the embeddings model Epoch: 1/8 Step: 246/263 93.54% |██████████████████▋ |ETA: 8.82s val_loss: 11.644591 val_accuracy: 45.693340 (deepparse) E:\Workbench\customparser>
And for your second question
(deepparse) E:\Workbench\customparser>python -c "import poutyne, deepparse; print(f'{poutyne.__version__=}, {deepparse.__version__=}')" poutyne.__version__='1.17.1', deepparse.__version__='0.9.9'
Can you test with the installation of Deepparse in the branch
[remove_context_manager](https://github.com/GRAAL-Research/deepparse/tree/remove_context_manager)
?To install it, do
pip install -U git+https://github.com/GRAAL-Research/deepparse.git@remove_context_manager
I installed the branch you said
(deepparse) E:\Workbench\deepparse>python -c "import poutyne, deepparse; print(f'{poutyne.__version__=}, {deepparse.__version__=}')"
poutyne.__version__='1.17.1', deepparse.__version__='0.9.9.dev1+63e120e'
Unfortunately the script stops without any error again
(deepparse) E:\Workbench\deepparse>python gendata.py
Loading the embeddings model
Epoch: 1/8 Step: 3430/3430 100.00% |████████████████████|ETA: 0.00s loss: 9.646952 accuracy: 62.782402
E:\Workbench\deepparse\deepparse\Lib\site-packages\google_crc32c\__init__.py:29: RuntimeWarning: As the c extension couldn't be imported, `google-crc32c` is using a pure python implementation that is significantly slower. If possible, please configure a c build environment and compile the extension
warnings.warn(_SLOW_CRC32C_WARNING, RuntimeWarning)
<warnings...>
Epoch: 1/8 Step: 1/6860 0.01% | |ETA: 5d9h53m28.24s val_loss: 10.187529 val_accuracy: 59.7909
Epoch: 1/8 Step: 2/6860 0.03% | |ETA: 1d19h35m28.10s val_loss: 9.240340 val_accuracy: 62.8254
Epoch: 1/8 Step: 3/6860 0.04% | |ETA: 22h2m38.10s val_loss: 9.925394 val_accuracy: 61.285095
Epoch: 1/8 Step: 4/6860 0.06% | |ETA: 13h25m11.12s val_loss: 10.425822 val_accuracy: 62.96918
Epoch: 1/8 Step: 5/6860 0.07% | |ETA: 9h5m31.91s val_loss: 10.256552 val_accuracy: 62.049858
Epoch: 1/8 Train steps: 3430 Val steps: 6860 1h11m21.12s loss: 10.708495 accuracy: 59.209902 val_loss: 9.351922 val_accuracy: 61.946527
Epoch 1: val_loss improved from inf to 9.35192, saving file to E:/Workbench/deepparse/checkpoint_shuffle_all\checkpoint_epoch_1.ckpt
<warnings...>
Epoch: 2/8 Step: 3430/3430 100.00% |████████████████████|ETA: 0.00s loss: 5.709274 accuracy: 77.497063
<warnings...>
Epoch: 2/8 Step: 1/6860 0.01% | |ETA: 6d5h15m16.45s val_loss: 9.081129 val_accuracy: 63.80637
Epoch: 2/8 Step: 2/6860 0.03% | |ETA: 2d2h3m53.11s val_loss: 8.360647 val_accuracy: 66.869804
Epoch: 2/8 Step: 3/6860 0.04% | |ETA: 1d1h15m24.93s val_loss: 7.999210 val_accuracy: 65.76673
Epoch: 2/8 Step: 4/6860 0.06% | |ETA: 15h20m37.99s val_loss: 9.357580 val_accuracy: 67.338936
Epoch: 2/8 Train steps: 3430 Val steps: 6860 1h13m1.49s loss: 7.740229 accuracy: 69.344001 val_loss: 8.360033 val_accuracy: 66.118986
Epoch 2: val_loss improved from 9.35192 to 8.36003, saving file to E:/Workbench/deepparse/checkpoint_shuffle_all\checkpoint_epoch_2.ckpt
<warnings...>
Epoch: 3/8 Step: 3430/3430 100.00% |████████████████████|ETA: 0.00s loss: 4.959105 accuracy: 77.758217
<warnings...>
Epoch: 3/8 Step: 1/6860 0.01% | |ETA: 5d18h35m52.35s val_loss: 7.649683 val_accuracy: 66.7216
Epoch: 3/8 Step: 2/6860 0.03% | |ETA: 1d22h31m34.06s val_loss: 7.259346 val_accuracy: 70.7479
Epoch: 3/8 Step: 3/6860 0.04% | |ETA: 23h30m37.57s val_loss: 7.096193 val_accuracy: 68.358528
Epoch: 3/8 Train steps: 3430 Val steps: 6860 1h12m34.38s loss: 6.794704 accuracy: 72.475230 val_loss: 7.592425 val_accuracy: 68.231226
Epoch 3: val_loss improved from 8.36003 to 7.59243, saving file to E:/Workbench/deepparse/checkpoint_shuffle_all\checkpoint_epoch_3.ckpt
<warnings...>
Epoch: 4/8 Step: 3430/3430 100.00% |████████████████████|ETA: 0.00s loss: 4.774173 accuracy: 80.708893
<warnings...>
Epoch: 4/8 Step: 2/6860 0.03% | |ETA: 1d17h58m18.31s val_loss: 6.849178 val_accuracy: 71.6343
Epoch: 4/8 Step: 3/6860 0.04% | |ETA: 21h12m10.24s val_loss: 6.886920 val_accuracy: 68.790497
Epoch: 4/8 Step: 2113/6860 30.80% |██████▏ |ETA: 17m40.76s val_loss: 8.546698 val_accuracy: 69.135132
(deepparse) E:\Workbench\deepparse>
@gokdumano Is this warning something regular?
RuntimeWarning: As the c extension couldn't be imported, `google-crc32c` is using a pure python implementation that is significantly slower. If possible, please configure a c build environment and compile the extension
warnings.warn(_SLOW_CRC32C_WARNING, RuntimeWarning)
How much RAM and CPU do you have? Do you monitor the RAM usage?
@gokdumano Is this warning something regular?
RuntimeWarning: As the c extension couldn't be imported, `google-crc32c` is using a pure python implementation that is significantly slower. If possible, please configure a c build environment and compile the extension warnings.warn(_SLOW_CRC32C_WARNING, RuntimeWarning)
How much RAM and CPU do you have? Do you monitor the RAM usage?
With about 2.6m records We observed RAM usage could go up to 100%. We halved the number of record and got consecutive whitespaces
errors 👌 After adjusting our pipeline the training process went smoothly 🎉
thanks a lot @davebulaval & @jarkkojarvinen
That is what I thought: OOM RAM errors are not well captured by Python.
I'm glad we could resolve this issue for both of you.
I also have added the argument data_cleaning_pre_processing_fn
to the DatasetContainer interface. A cleaning function will be applied before validation.
@gokdumano, would you be willing to open a PR to implement a default data_cleaning_pre_processing_fn
that applies to any loading? That way, such a problem can be resolved in the future.
I try to retrain pre-trained model (bpemb with attention) with custom data set but suddenly training is not working any more. I just stops randomly without any errors. I have been successfully retrained previously with deepparse.
I got now only following output and no errors occurred. The logging folder "best_checkpoints" doesn't contain any other files than log.tsv and plots-folder. Log.tsv is empty and plots in folder are without data. Number of processed steps varies between runs but still no results. I am having Azure databricks with single node cluster which have 56GB RAM and 16 cores so capasity should be ok. But also same issue on my laptop machine. I am using deepparse==0.9.9 with Python 3.10.12
I even tried to simplify my code so easy as possible to debug and evaluate:
Training and test data has been augmented with some additional info, since some of our cases contains some "noise" in addresses. Our training data csv is e.g. where GeneralDelivery contains some generated nonce data which look a like "real". Validation data set is similar. Training data contains 300k lines and test data 50k lines.
Do you have any ideas what to do or what might be wrong?
EDIT (28.5.2024):
I did another run in my local laptop and got results today morning. Now retrain goes to epoch 2 but end after 86% progress. Used code and training material is same except in code
base_path=".."
:and now checkpoints-folder contains epoch_1 -files and log.tsv contains single line: