issue with UDPOS Dataset.

bentrevett / pytorch-pos-tagging

A tutorial on how to implement models for part-of-speech tagging using PyTorch and TorchText.

MIT License

178 stars 27 forks source link

issue with UDPOS Dataset. #9

Open apkbala107 opened 3 years ago

apkbala107 commented 3 years ago

train_data, valid_data, test_data = datasets.UDPOS.splits(fields)

here error shown as

Traceback (most recent call last): File "/home/balamurugan/myresearch/lstm.py", line 27, in train_data, valid_data, test_data = datasets.UDPOS.split(fields) AttributeError: 'function' object has no attribute 'split'

i am using torchtext version 0.9.1.

is this version problem?

which version you are using?

bentrevett commented 3 years ago

I need to update the tutorials for version 0.9+.

You need to replace:

from torchtext import data
from torchtext import datasets

with:

from torchtext.legacy import datasets
from torchtext.legacy import data

This should make them work for 0.9.1.

apkbala107 commented 3 years ago

When will you update this tutorials?

in spacy en model not supported so that i am using nlp = spacy.load("en_core_web_sm") model. if any option available for using en model..please give suggestion.

one more thing... can you help me for universal dependencies bank for tamil language to implement this model?

bentrevett commented 3 years ago

I'll try and update the tutorials over the weekend.

The en model is equivalent to the en_core_web_sm, I believe. It's just that spaCy 3.0 changed the naming convention to make things more explicit.

apkbala107 commented 3 years ago

please help me for processing this bilstm on tamil pos tagging.

give some guidance how to pre process the tamil language in Universal dependency tree bank

bentrevett commented 3 years ago

I don't have any experience with pre-processing Tamil specifically, however you want to tokenize your training data as a sequence of words and a sequence of tags so that the tag for word i, tag[i], is the tag for word i, word[i].

For example:

words = ['hello', 'my', 'name', 'is', 'ben']
tags = ['<greeting>', '<blank>', '<blank>', '<name>']

You then need to create a vocabulary for the words and a separate vocabulary for the tags, use the vocabularies to numericalize your data and then convert the numericalized data to tensors before passing it into your model.

apkbala107 commented 3 years ago

thanks for your guidance. did you have any recorded video for this code. so that i could better understand.. this is my whatsapp and personal number also 8428365636...mail id is apkbala107@gmail.com,.. may you give your contact please?

apkbala107 commented 3 years ago

TEXT = data.Field(lower = True)#set lower = True which lowercases all of the text. UD_TAGS = data.Field(unk_token = None)#TorchText Fields initialize a default unknown token, , which we remove by setting unk_token = None PTB_TAGS = data.Field(unk_token = None)

fields = (("text", TEXT), ("udtags", UD_TAGS), ("ptbtags", PTB_TAGS))#load data into fields train_data, valid_data, test_data = datasets.UDPOS.splits(fields)

i don't understand what is happening above code..please explain what is happening