GRAAL-Research / deepparse

Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning
https://deepparse.org/
GNU Lesser General Public License v3.0
299 stars 30 forks source link

Question about of training problem #226

Closed jarkkojarvinen closed 4 months ago

jarkkojarvinen commented 5 months ago

I try to retrain pre-trained model (bpemb with attention) with custom data set but suddenly training is not working any more. I just stops randomly without any errors. I have been successfully retrained previously with deepparse.

I got now only following output and no errors occurred. The logging folder "best_checkpoints" doesn't contain any other files than log.tsv and plots-folder. Log.tsv is empty and plots in folder are without data. Number of processed steps varies between runs but still no results. I am having Azure databricks with single node cluster which have 56GB RAM and 16 cores so capasity should be ok. But also same issue on my laptop machine. I am using deepparse==0.9.9 with Python 3.10.12

Loading the embeddings model
Starting training and training will be performed with batch size 32, 8 epochs and lr 0.001

Epoch: 1/8 Step:    1/7705   0.01% |                    |ETA: 3h49m37.76s loss: 78.376259 accuracy: 56.737591
Epoch: 1/8 Step:    2/7705   0.03% |                    |ETA: 3h0m59.66s loss: 79.313446 accuracy: 52.475250 
Epoch: 1/8 Step:    3/7705   0.04% |                    |ETA: 2h39m19.91s loss: 75.118568 accuracy: 54.045307
...<reducted>...
Epoch: 1/8 Step: 5098/7705  66.16% |█████████████▏      |ETA: 44m52.66s loss: 0.118803 accuracy: 99.662163
Epoch: 1/8 Step: 5099/7705  66.18% |█████████████▏      |ETA: 44m51.68s loss: 0.129328 accuracy: 99.641579
Epoch: 1/8 Step: 5100/7705  66.19% |█████████████▏      |ETA: 44m50.79s loss: 0.118927 accuracy: 99.331108

I even tried to simplify my code so easy as possible to debug and evaluate:

from datetime import date
from deepparse.parser import AddressParser
from deepparse.dataset_container import CSVDatasetContainer
from deepparse import download_model

num_workers = 8

base_path = "/dbfs/FileStore"
cache_dir = f"{base_path}/data/annotation/cache"
# download_model("bpemb-attention", cache_dir)

attention_mechanism = True
model_type = "best"  # = bpemb
device = "cpu"
address_parser = AddressParser(
    model_type=model_type,
    device=device,
    attention_mechanism=attention_mechanism,
    cache_dir=cache_dir,
    offline=True,
)

train_dataset_filename = f"{base_path}/data/annotation/training/train/train.csv"
training_container = CSVDatasetContainer(
    train_dataset_filename,
    column_names=["Address", "Tags"],
    separator=",",
)

version = date.today()
logging_path = f"{base_path}/data/annotation/training/{version}/{model_type}_checkpoints"

epochs = 8
train_ratio = 0.8
batch_size = 32
learning_rate = 0.001

print(f"Starting training and training will be performed with batch size {batch_size}, {epochs} epochs and lr {learning_rate}")

address_parser.retrain(
    training_container,
    train_ratio=train_ratio,
    epochs=epochs,
    batch_size=batch_size,
    num_workers=num_workers,
    learning_rate=learning_rate,
    logging_path=logging_path,
)

Training and test data has been augmented with some additional info, since some of our cases contains some "noise" in addresses. Our training data csv is e.g. where GeneralDelivery contains some generated nonce data which look a like "real". Validation data set is similar. Training data contains 300k lines and test data 50k lines.

Address,Tags
Tsasounakuja 130-137 83100 LIPERI Kanerva Tmi,"['StreetName', 'StreetNumber', 'PostalCode', 'Municipality', 'GeneralDelivery', 'GeneralDelivery']"

Do you have any ideas what to do or what might be wrong?

EDIT (28.5.2024):

I did another run in my local laptop and got results today morning. Now retrain goes to epoch 2 but end after 86% progress. Used code and training material is same except in code base_path="..":

Loading the embeddings model
Starting training and training will be performed with batch size 32, 8 epochs and lr 0.001
Epoch: 1/8 Train steps: 7705 Val steps: 1927 2h56m39.91s loss: 0.538323 accuracy: 99.180328 val_loss: 0.105898 val_accuracy: 99.756747
Epoch 1: val_loss improved from inf to 0.10590, saving file to ../data/annotation/training/2024-05-27/best_checkpoints/checkpoint_epoch_1.ckpt
Epoch: 2/8 Step: 6661/7705  86.45% |█████████████████▎  |ETA: 22m52.54s loss: 0.020454 accuracy: 100.000000%

and now checkpoints-folder contains epoch_1 -files and log.tsv contains single line:

image
davebulaval commented 5 months ago

Can you share your dataset so I can test it?

david.beauchemin@baseline.quebec

davebulaval commented 5 months ago

Update:

I am currently training using your script, but for fasttext since the download of BPEmp is still broken. I have written to the package's maintenor to accelerate the fix.

jarkkojarvinen commented 5 months ago

Thanks. I tried also with fasttext and same issue occurred.

davebulaval commented 5 months ago

Uhm. I got the same pattern. I have also tried using a GPU and have the same problem. The problem occurs during validation on my side. The error occurs at the same data point. It feels like a problem on the dataset to me. Can you look into your dataset? We use a seed for reproducibility; thus, if the problem occurs at the same data point, it feels like a problem.

Yan can use our code base to create the dataloader with the default seed value (42). We have a protected method to create the dataloader (see here).

jarkkojarvinen commented 5 months ago

ouch, I think I found the reason. In our address source data (National Land Survey) there has been inserted some new addresses with two spaces e.g. "Räyskälän kantatie" and this has been annotated [StreetName, StreetName, StreetName]. There are couple of similar addresses and we didn't clean the source carefully. Strip was done but not double spaces.

I believe that when training is done with deepparse, the extra space will be stripped of and then training will fail due of mismatch of address token and tag amount. Haven't yet confirmed does this help, but seems very clear to me.

gokdumano commented 5 months ago

My team and I are also working for developing a custom address parser, and we faced the same issue as well. No warning, no error, no pattern in error (might be the first or the third epoch). The training process just stops randomly, regardless of the size of the data we use (might be 10k, 1m, etc.). There are some times we managed to complete this step but this randomness is a thorn in our sides.

We use an aggressive preprocessing pipeline, as shown below, to get rid of any invalid character. The training container did not raise any error before startning the process and I suspect that mismatched token/tag is not the issue in our case.

lower_cleaning_table = str.maketrans('CÇGĞIİOÖSŞUÜ', 'cçgğıioösşuü')

def lower_cleaning(address: str) -> str:
    return address.translate(lower_cleaning_table).lower()

def multi_whitespaces_cleaning(address: str) -> str:
    return re.sub(r'\s{2,}', ' ', address)

def trailing_whitespace_cleaning(address: str) -> str:
    return re.sub(r'(^\s{1,}|\s{1,}$)', '', address)

def invalid_char_cleaning(address: str) -> str:
    return re.sub(r'(\s*-\s*|\s*\+\s*|\s*&\s*|<null>|[^abcçdefgğhiıjklmnoöprsştuüvyzqwx 0-9]{1,})', '', address)

def cleaning(address: str) -> str:
    return invalid_char_cleaning(
        trailing_whitespace_cleaning(
            multi_whitespaces_cleaning(
                lower_cleaning(
                    address
                    ))))
jarkkojarvinen commented 5 months ago

I created simple validation tool to validate CSV dataset. In my case this highlighted issues and I could easily fix those. Please be free to use.

import ast
import pandas as pd
from tqdm import tqdm

tqdm.pandas()

def validate_address_tags(address, tags):
    """
    Validates that the number of tags matches the number of words in the address.

    Parameters:
    address (str): The address string.
    tags (str): The tags string.

    Returns:
    bool: True if the number of tags matches the number of words, False otherwise.
    """
    try:
        tags_list = ast.literal_eval(
            tags
        )  # Convert string representation of list to actual list
    except (ValueError, SyntaxError) as e:
        print(f"Invalid format for tags: {tags}. Error: {e}")
        return False

    address_parts = address.split()
    return len(address_parts) == len(tags_list)

def validate_dataset(file_path):
    """
    Validates the dataset by checking if the number of tags matches the number of words in the address.

    Parameters:
    file_path (str): The path to the CSV file.
    """
    df = pd.read_csv(file_path)

    # Apply validation to each row and store results in a new column
    df["is_valid"] = df.progress_apply(
        lambda row: validate_address_tags(row["Address"], row["Tags"]), axis=1
    )

    # Identify rows that failed validation
    invalid_rows = df[~df["is_valid"]]

    if invalid_rows.empty:
        print("All rows are valid!")
    else:
        print("Some rows have validation issues:")
        for index, row in invalid_rows.iterrows():
            print(
                f"Validation failed at index {index}: Address - '{row['Address']}' | Tags - {row['Tags']}"
            )

# Usage
validate_dataset(csv_train_filename)
validate_dataset(csv_test_filename)
davebulaval commented 5 months ago

I created simple validation tool to validate CSV dataset. In my case this highlighted issues and I could easily fix those. Please be free to use.

import ast
import pandas as pd
from tqdm import tqdm

tqdm.pandas()

def validate_address_tags(address, tags):
    """
    Validates that the number of tags matches the number of words in the address.

    Parameters:
    address (str): The address string.
    tags (str): The tags string.

    Returns:
    bool: True if the number of tags matches the number of words, False otherwise.
    """
    try:
        tags_list = ast.literal_eval(
            tags
        )  # Convert string representation of list to actual list
    except (ValueError, SyntaxError) as e:
        print(f"Invalid format for tags: {tags}. Error: {e}")
        return False

    address_parts = address.split()
    return len(address_parts) == len(tags_list)

def validate_dataset(file_path):
    """
    Validates the dataset by checking if the number of tags matches the number of words in the address.

    Parameters:
    file_path (str): The path to the CSV file.
    """
    df = pd.read_csv(file_path)

    # Apply validation to each row and store results in a new column
    df["is_valid"] = df.progress_apply(
        lambda row: validate_address_tags(row["Address"], row["Tags"]), axis=1
    )

    # Identify rows that failed validation
    invalid_rows = df[~df["is_valid"]]

    if invalid_rows.empty:
        print("All rows are valid!")
    else:
        print("Some rows have validation issues:")
        for index, row in invalid_rows.iterrows():
            print(
                f"Validation failed at index {index}: Address - '{row['Address']}' | Tags - {row['Tags']}"
            )

# Usage
validate_dataset(csv_train_filename)
validate_dataset(csv_test_filename)

We already applied this sort of validation, but we did not have any for double whitespace. I will add this verification for further users.

davebulaval commented 5 months ago

My team and I are also working for developing a custom address parser, and we faced the same issue as well. No warning, no error, no pattern in error (might be the first or the third epoch). The training process just stops randomly, regardless of the size of the data we use (might be 10k, 1m, etc.). There are some times we managed to complete this step but this randomness is a thorn in our sides.

We use an aggressive preprocessing pipeline, as shown below, to get rid of any invalid character. The training container did not raise any error before startning the process and I suspect that mismatched token/tag is not the issue in our case.

lower_cleaning_table = str.maketrans('CÇGĞIİOÖSŞUÜ', 'cçgğıioösşuü')

def lower_cleaning(address: str) -> str:
    return address.translate(lower_cleaning_table).lower()

def multi_whitespaces_cleaning(address: str) -> str:
    return re.sub(r'\s{2,}', ' ', address)

def trailing_whitespace_cleaning(address: str) -> str:
    return re.sub(r'(^\s{1,}|\s{1,}$)', '', address)

def invalid_char_cleaning(address: str) -> str:
    return re.sub(r'(\s*-\s*|\s*\+\s*|\s*&\s*|<null>|[^abcçdefgğhiıjklmnoöprsştuüvyzqwx 0-9]{1,})', '', address)

def cleaning(address: str) -> str:
    return invalid_char_cleaning(
        trailing_whitespace_cleaning(
            multi_whitespaces_cleaning(
                lower_cleaning(
                    address
                    ))))

Can you try again the training but with verbose=True?

What is your poutyne version?

I remember that at one point, we had many problems with warnings, and we installed a context manager that captured them. I think the context manager is blocking the errors.

gokdumano commented 5 months ago

My team and I are also working for developing a custom address parser, and we faced the same issue as well. No warning, no error, no pattern in error (might be the first or the third epoch). The training process just stops randomly, regardless of the size of the data we use (might be 10k, 1m, etc.). There are some times we managed to complete this step but this randomness is a thorn in our sides. We use an aggressive preprocessing pipeline, as shown below, to get rid of any invalid character. The training container did not raise any error before startning the process and I suspect that mismatched token/tag is not the issue in our case.

lower_cleaning_table = str.maketrans('CÇGĞIİOÖSŞUÜ', 'cçgğıioösşuü')

def lower_cleaning(address: str) -> str:
    return address.translate(lower_cleaning_table).lower()

def multi_whitespaces_cleaning(address: str) -> str:
    return re.sub(r'\s{2,}', ' ', address)

def trailing_whitespace_cleaning(address: str) -> str:
    return re.sub(r'(^\s{1,}|\s{1,}$)', '', address)

def invalid_char_cleaning(address: str) -> str:
    return re.sub(r'(\s*-\s*|\s*\+\s*|\s*&\s*|<null>|[^abcçdefgğhiıjklmnoöprsştuüvyzqwx 0-9]{1,})', '', address)

def cleaning(address: str) -> str:
    return invalid_char_cleaning(
        trailing_whitespace_cleaning(
            multi_whitespaces_cleaning(
                lower_cleaning(
                    address
                    ))))

Can you try again the training but with verbose=True?

What is your poutyne version?

I remember that at one point, we had many problems with warnings, and we installed a context manager that captured them. I think the context manager is blocking the errors.

I ran the code with verbose=True as you say, still no errors or warnings

## main.py
address_parser = AddressParser(model_type='bpemb', device=0)

address_parser.retrain(
         dataContainer
        ,verbose=True
        ,num_workers=8
        ,batch_size=254
        ,train_ratio=1/3
        ,epochs=8
        ,learning_rate=0.001
        ,prediction_tags=tags
        ,logging_path=logging_path
        ,name_of_the_retrain_parser='CustomParser'
     )

## conda environment (deepparse)
(deepparse) E:\Workbench\customparser>python main.py
Loading the embeddings model
Epoch: 1/8 Step: 246/263  93.54% |██████████████████▋ |ETA: 8.82s val_loss: 11.644591 val_accuracy: 45.693340
(deepparse) E:\Workbench\customparser>

And for your second question

(deepparse) E:\Workbench\customparser>python -c "import poutyne, deepparse; print(f'{poutyne.__version__=}, {deepparse.__version__=}')"
poutyne.__version__='1.17.1', deepparse.__version__='0.9.9'
jarkkojarvinen commented 5 months ago

I confirm that training is working again! Thank you for support. I used the script I gave earlier and fixed tags mismatches caused by extra spaces in our address sources which I used to generate training material with annotations e.g. Some Street was annotated StreetName, StreetName, StreetName. Double spaces should be removed before annotation. After fixing these entries training was successful. The training took 25 hours with my databricks machine and model results were good.

davebulaval commented 5 months ago

I confirm that training is working again! Thank you for support. I used the script I gave earlier and fixed tags mismatches caused by extra spaces in our address sources which I used to generate training material with annotations e.g. Some Street was annotated StreetName, StreetName, StreetName. Double spaces should be removed before annotation. After fixing these entries training was successful. The training took 25 hours with my databricks machine and model results were good.

The next release will include a validation for this.

davebulaval commented 5 months ago

My team and I are also working for developing a custom address parser, and we faced the same issue as well. No warning, no error, no pattern in error (might be the first or the third epoch). The training process just stops randomly, regardless of the size of the data we use (might be 10k, 1m, etc.). There are some times we managed to complete this step but this randomness is a thorn in our sides. We use an aggressive preprocessing pipeline, as shown below, to get rid of any invalid character. The training container did not raise any error before startning the process and I suspect that mismatched token/tag is not the issue in our case.

lower_cleaning_table = str.maketrans('CÇGĞIİOÖSŞUÜ', 'cçgğıioösşuü')

def lower_cleaning(address: str) -> str:
    return address.translate(lower_cleaning_table).lower()

def multi_whitespaces_cleaning(address: str) -> str:
    return re.sub(r'\s{2,}', ' ', address)

def trailing_whitespace_cleaning(address: str) -> str:
    return re.sub(r'(^\s{1,}|\s{1,}$)', '', address)

def invalid_char_cleaning(address: str) -> str:
    return re.sub(r'(\s*-\s*|\s*\+\s*|\s*&\s*|<null>|[^abcçdefgğhiıjklmnoöprsştuüvyzqwx 0-9]{1,})', '', address)

def cleaning(address: str) -> str:
    return invalid_char_cleaning(
        trailing_whitespace_cleaning(
            multi_whitespaces_cleaning(
                lower_cleaning(
                    address
                    ))))

Can you try again the training but with verbose=True? What is your poutyne version? I remember that at one point, we had many problems with warnings, and we installed a context manager that captured them. I think the context manager is blocking the errors.

I ran the code with verbose=True as you say, still no errors or warnings

## main.py
address_parser = AddressParser(model_type='bpemb', device=0)

address_parser.retrain(
         dataContainer
        ,verbose=True
        ,num_workers=8
        ,batch_size=254
        ,train_ratio=1/3
        ,epochs=8
        ,learning_rate=0.001
        ,prediction_tags=tags
        ,logging_path=logging_path
        ,name_of_the_retrain_parser='CustomParser'
     )

## conda environment (deepparse)
(deepparse) E:\Workbench\customparser>python main.py
Loading the embeddings model
Epoch: 1/8 Step: 246/263  93.54% |██████████████████▋ |ETA: 8.82s val_loss: 11.644591 val_accuracy: 45.693340
(deepparse) E:\Workbench\customparser>

And for your second question

(deepparse) E:\Workbench\customparser>python -c "import poutyne, deepparse; print(f'{poutyne.__version__=}, {deepparse.__version__=}')"
poutyne.__version__='1.17.1', deepparse.__version__='0.9.9'

Can you test with the installation of Deepparse in the branch [remove_context_manager](https://github.com/GRAAL-Research/deepparse/tree/remove_context_manager)?

To install it, do pip install -U git+https://github.com/GRAAL-Research/deepparse.git@remove_context_manager

gokdumano commented 5 months ago

My team and I are also working for developing a custom address parser, and we faced the same issue as well. No warning, no error, no pattern in error (might be the first or the third epoch). The training process just stops randomly, regardless of the size of the data we use (might be 10k, 1m, etc.). There are some times we managed to complete this step but this randomness is a thorn in our sides. We use an aggressive preprocessing pipeline, as shown below, to get rid of any invalid character. The training container did not raise any error before startning the process and I suspect that mismatched token/tag is not the issue in our case.

lower_cleaning_table = str.maketrans('CÇGĞIİOÖSŞUÜ', 'cçgğıioösşuü')

def lower_cleaning(address: str) -> str:
    return address.translate(lower_cleaning_table).lower()

def multi_whitespaces_cleaning(address: str) -> str:
    return re.sub(r'\s{2,}', ' ', address)

def trailing_whitespace_cleaning(address: str) -> str:
    return re.sub(r'(^\s{1,}|\s{1,}$)', '', address)

def invalid_char_cleaning(address: str) -> str:
    return re.sub(r'(\s*-\s*|\s*\+\s*|\s*&\s*|<null>|[^abcçdefgğhiıjklmnoöprsştuüvyzqwx 0-9]{1,})', '', address)

def cleaning(address: str) -> str:
    return invalid_char_cleaning(
        trailing_whitespace_cleaning(
            multi_whitespaces_cleaning(
                lower_cleaning(
                    address
                    ))))

Can you try again the training but with verbose=True? What is your poutyne version? I remember that at one point, we had many problems with warnings, and we installed a context manager that captured them. I think the context manager is blocking the errors.

I ran the code with verbose=True as you say, still no errors or warnings

## main.py
address_parser = AddressParser(model_type='bpemb', device=0)

address_parser.retrain(
         dataContainer
        ,verbose=True
        ,num_workers=8
        ,batch_size=254
        ,train_ratio=1/3
        ,epochs=8
        ,learning_rate=0.001
        ,prediction_tags=tags
        ,logging_path=logging_path
        ,name_of_the_retrain_parser='CustomParser'
     )

## conda environment (deepparse)
(deepparse) E:\Workbench\customparser>python main.py
Loading the embeddings model
Epoch: 1/8 Step: 246/263  93.54% |██████████████████▋ |ETA: 8.82s val_loss: 11.644591 val_accuracy: 45.693340
(deepparse) E:\Workbench\customparser>

And for your second question

(deepparse) E:\Workbench\customparser>python -c "import poutyne, deepparse; print(f'{poutyne.__version__=}, {deepparse.__version__=}')"
poutyne.__version__='1.17.1', deepparse.__version__='0.9.9'

Can you test with the installation of Deepparse in the branch [remove_context_manager](https://github.com/GRAAL-Research/deepparse/tree/remove_context_manager)?

To install it, do pip install -U git+https://github.com/GRAAL-Research/deepparse.git@remove_context_manager

I installed the branch you said

(deepparse) E:\Workbench\deepparse>python -c "import poutyne, deepparse; print(f'{poutyne.__version__=}, {deepparse.__version__=}')"
poutyne.__version__='1.17.1', deepparse.__version__='0.9.9.dev1+63e120e'

Unfortunately the script stops without any error again

(deepparse) E:\Workbench\deepparse>python gendata.py
Loading the embeddings model

Epoch: 1/8 Step: 3430/3430 100.00% |████████████████████|ETA: 0.00s loss: 9.646952 accuracy: 62.782402
E:\Workbench\deepparse\deepparse\Lib\site-packages\google_crc32c\__init__.py:29: RuntimeWarning: As the c extension couldn't be imported, `google-crc32c` is using a pure python implementation that is significantly slower. If possible, please configure a c build environment and compile the extension
  warnings.warn(_SLOW_CRC32C_WARNING, RuntimeWarning)
<warnings...>
Epoch: 1/8 Step:    1/6860   0.01% |                    |ETA: 5d9h53m28.24s val_loss: 10.187529 val_accuracy: 59.7909
Epoch: 1/8 Step:    2/6860   0.03% |                    |ETA: 1d19h35m28.10s val_loss: 9.240340 val_accuracy: 62.8254
Epoch: 1/8 Step:    3/6860   0.04% |                    |ETA: 22h2m38.10s val_loss: 9.925394 val_accuracy: 61.285095
Epoch: 1/8 Step:    4/6860   0.06% |                    |ETA: 13h25m11.12s val_loss: 10.425822 val_accuracy: 62.96918
Epoch: 1/8 Step:    5/6860   0.07% |                    |ETA: 9h5m31.91s val_loss: 10.256552 val_accuracy: 62.049858
Epoch: 1/8 Train steps: 3430 Val steps: 6860 1h11m21.12s loss: 10.708495 accuracy: 59.209902 val_loss: 9.351922 val_accuracy: 61.946527
Epoch 1: val_loss improved from inf to 9.35192, saving file to E:/Workbench/deepparse/checkpoint_shuffle_all\checkpoint_epoch_1.ckpt
<warnings...>
Epoch: 2/8 Step: 3430/3430 100.00% |████████████████████|ETA: 0.00s loss: 5.709274 accuracy: 77.497063
<warnings...>
Epoch: 2/8 Step:    1/6860   0.01% |                    |ETA: 6d5h15m16.45s val_loss: 9.081129 val_accuracy: 63.80637
Epoch: 2/8 Step:    2/6860   0.03% |                    |ETA: 2d2h3m53.11s val_loss: 8.360647 val_accuracy: 66.869804
Epoch: 2/8 Step:    3/6860   0.04% |                    |ETA: 1d1h15m24.93s val_loss: 7.999210 val_accuracy: 65.76673
Epoch: 2/8 Step:    4/6860   0.06% |                    |ETA: 15h20m37.99s val_loss: 9.357580 val_accuracy: 67.338936
Epoch: 2/8 Train steps: 3430 Val steps: 6860 1h13m1.49s loss: 7.740229 accuracy: 69.344001 val_loss: 8.360033 val_accuracy: 66.118986
Epoch 2: val_loss improved from 9.35192 to 8.36003, saving file to E:/Workbench/deepparse/checkpoint_shuffle_all\checkpoint_epoch_2.ckpt
<warnings...>
Epoch: 3/8 Step: 3430/3430 100.00% |████████████████████|ETA: 0.00s loss: 4.959105 accuracy: 77.758217
<warnings...>
Epoch: 3/8 Step:    1/6860   0.01% |                    |ETA: 5d18h35m52.35s val_loss: 7.649683 val_accuracy: 66.7216
Epoch: 3/8 Step:    2/6860   0.03% |                    |ETA: 1d22h31m34.06s val_loss: 7.259346 val_accuracy: 70.7479
Epoch: 3/8 Step:    3/6860   0.04% |                    |ETA: 23h30m37.57s val_loss: 7.096193 val_accuracy: 68.358528
Epoch: 3/8 Train steps: 3430 Val steps: 6860 1h12m34.38s loss: 6.794704 accuracy: 72.475230 val_loss: 7.592425 val_accuracy: 68.231226
Epoch 3: val_loss improved from 8.36003 to 7.59243, saving file to E:/Workbench/deepparse/checkpoint_shuffle_all\checkpoint_epoch_3.ckpt
<warnings...>
Epoch: 4/8 Step: 3430/3430 100.00% |████████████████████|ETA: 0.00s loss: 4.774173 accuracy: 80.708893
<warnings...>
Epoch: 4/8 Step:    2/6860   0.03% |                    |ETA: 1d17h58m18.31s val_loss: 6.849178 val_accuracy: 71.6343
Epoch: 4/8 Step:    3/6860   0.04% |                    |ETA: 21h12m10.24s val_loss: 6.886920 val_accuracy: 68.790497
Epoch: 4/8 Step: 2113/6860  30.80% |██████▏             |ETA: 17m40.76s val_loss: 8.546698 val_accuracy: 69.135132
(deepparse) E:\Workbench\deepparse>
davebulaval commented 5 months ago

@gokdumano Is this warning something regular?

RuntimeWarning: As the c extension couldn't be imported, `google-crc32c` is using a pure python implementation that is significantly slower. If possible, please configure a c build environment and compile the extension
  warnings.warn(_SLOW_CRC32C_WARNING, RuntimeWarning)

How much RAM and CPU do you have? Do you monitor the RAM usage?

gokdumano commented 5 months ago

@gokdumano Is this warning something regular?

RuntimeWarning: As the c extension couldn't be imported, `google-crc32c` is using a pure python implementation that is significantly slower. If possible, please configure a c build environment and compile the extension
  warnings.warn(_SLOW_CRC32C_WARNING, RuntimeWarning)

How much RAM and CPU do you have? Do you monitor the RAM usage?

With about 2.6m records We observed RAM usage could go up to 100%. We halved the number of record and got consecutive whitespaces errors 👌 After adjusting our pipeline the training process went smoothly 🎉

thanks a lot @davebulaval & @jarkkojarvinen

davebulaval commented 5 months ago

That is what I thought: OOM RAM errors are not well captured by Python.

I'm glad we could resolve this issue for both of you.

I also have added the argument data_cleaning_pre_processing_fn to the DatasetContainer interface. A cleaning function will be applied before validation.

@gokdumano, would you be willing to open a PR to implement a default data_cleaning_pre_processing_fn that applies to any loading? That way, such a problem can be resolved in the future.