Size of Training Data. - Githubissues

akazer2 commented 10 years ago

I'm trying to train a model with a text file that is 42G in size. I have more than enough memory on my machine but I seem to be getting a segmentation core dump while training. Any reason why this would happen?

My team and I have trained multiple models on smaller datasets on the same machine, so we are confident that crfsuit is setup correctly.

jndevanshu commented 9 years ago

Hi akazer2 Were you able to resolve the issue? I was also trying to model a text file (10 MB) but crfsuite gives segmentation fault. Thanks, in advance

viveksck commented 8 years ago

Did anyone manage to resolve this ?

usptact commented 8 years ago

The thing is that during training much more memory is quested than just fitting your dataset in the memory.

For this big datasets, I suggest to use online algorithms. I found the Vowpal Wabbit to be not only very versatile but also scaling very well. Yes, including sequence tagging as CRFSuite does. I can show how to do sequence tagging with VW.

bratao commented 8 years ago

@usptact , could you please provide an example of sequence tagging in Vowpal ? What command line and input format ?

usptact commented 8 years ago

The data format is similar to that of CRFSuite, except spaces are used to separate features. VW also introduces feature spaces. The following is a training example for sequence tagging in VW format (notice the empty line between the two examples; I am using only one feature space, called "f"):

label1 |f f1 f2 f3
label2 |f f2 f3 f4
label3 |f f4 f5 f1

label2 |f f2 f4
label3 |f f1 f3

The sequence tagging model can be trained with this command:

vw  --data train.feat \
    --cache \
    --passes 10 \                                   # keep this small
    --search_task sequence \              # the task is sequence tagging
    --search $NUM_LABELS \             # number of possible labels
    --search_rollin=policy \
    --search_rollout=none \
    --named_labels "$(< labels)" \      # provide a comma-separated list of string labels if integer labels are not used
    -b 28 \                                             # number of bits for feature hashing - more is better
    --l2=1e-5 \                                      # per-example regularization
    --l1=1e-7 \
    -f $MODEL \                                   # store the model
    --readable_model $MODEL.txt    # store the model in readable format

chokkan / crfsuite

Size of Training Data. #25