explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.82k stars 4.37k forks source link

Corpus loading hangs when running provided gpu config for transformer and ner components. #7258

Closed coltonflowers1 closed 3 years ago

coltonflowers1 commented 3 years ago

How to reproduce the behaviour

I have been trying to train a blank Spacy pipeline with NER and Transformer components.

First, I do the set-up/installation for pytorch and spacy provided on the embeddings-transformers page with cuda 10.1:

After exporting my prodigy dataset of facebook posts (which each possibly contain more than one sentence) using the data-to-spacy recipe and then converting the resulting json files to .spacy files using:

python -m spacy convert "/dbfs/FileStore/train-data.json" "training" python -m spacy convert "/dbfs/FileStore/eval-data.json" "training"

I pass the following spacy train command using verbose mode, with full_config.cfg produced by taking the default base cfg for ner with GPU/transformers provided by the Quickstart widget and autofilling using init fill-config.

-m spacy train /dbfs/FileStore/full_config.cfg --paths.train training/train-data.spacy --paths.dev training/eval-data.spacy --gpu-id 0 --nlp.batch_size 64 -V

and the command hangs after it has printed the "Loading corpus" messages.

2021-03-02 19:00:48.821033: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
Config overrides from CLI: ['paths.train', 'paths.dev', 'nlp.batch_size']
Set up nlp object from config
Loading corpus from path: training_gold/eval-data.spacy
Loading corpus from path: training_gold/train-data.spacy
Pipeline: ['transformer', 'ner']
Created vocabulary
Finished initializing nlp object
[W033] Training a new parser or NER using a model with no lexeme normalization table. This may degrade the performance of the model to some degree. If this is intentional or the language you're using doesn't have a normalization table, please ignore this warning. If this is surprising, make sure you have the spacy-lookups-data package installed. The languages with lexeme normalization tables are currently: da, de, el, en, id, lb, pt, ru, sr, ta, th
Initialized pipeline components: ['transformer', 'ner']
Loading corpus from path: training_gold/eval-data.spacy
Loading corpus from path: training_gold/train-data.spacy

with full_config.cfg given here:

[paths]
train = null
dev = null
vectors = null
init_tok2vec = null

[system]
gpu_allocator = "pytorch"
seed = 0

[nlp]
lang = "en"
pipeline = ["transformer","ner"]
batch_size = 128
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"@tokenizers":"spacy.Tokenizer.v1"}

[components]

[components.ner]
factory = "ner"
moves = null
update_with_oracle_cut_size = 100

[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null

[components.ner.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
pooling = {"@layers":"reduce_mean.v1"}
upstream = "*"

[components.transformer]
factory = "transformer"
max_batch_items = 4096
set_extra_annotations = {"@annotation_setters":"spacy-transformers.null_annotation_setter.v1"}

[components.transformer.model]
@architectures = "spacy-transformers.TransformerModel.v1"
name = "roberta-base"

[components.transformer.model.get_spans]
@span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96

[components.transformer.model.tokenizer_config]
use_fast = true

[corpora]

[corpora.dev]
@readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null

[corpora.train]
@readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 500
gold_preproc = false
limit = 0
augmenter = null

[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
before_to_disk = null

[training.batcher]
@batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
get_length = null

[training.logger]
@loggers = "spacy.ConsoleLogger.v1"
progress_bar = false

[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001

[training.optimizer.learn_rate]
@schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 0.00005

[training.score_weights]
ents_per_type = null
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0

[pretraining]

[initialize]
vectors = null
init_tok2vec = ${paths.init_tok2vec}
vocab_data = null
lookups = null
before_init = null
after_init = null

[initialize.components]

[initialize.tokenizer]

Your Environment

coltonflowers1 commented 3 years ago

Thought I should also add that I am working in a Databricks notebook.

svlandeg commented 3 years ago

Hm, that's weird. How long have you waited while it was hanging?

The easiest thing to try, to start debugging this, is to create a much smaller sample of your data. It's probably easiest to just cut the (copied) jsonl files after a few examples, export them again to much smaller binary files, and try the training again with otherwise the exact same parameters. I'd expect that would run through? (with obviously bad accuracy, but never mind that for now)

coltonflowers1 commented 3 years ago

Thank you for your reply @svlandeg . If I do the same from a web terminal using the same OS, I don't get the same hanging issue, so I think that this may be an issue with Databrick's notebooks not correctly fetching the results and not a Spacy issue

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.