curiosity-ai / catalyst

🚀 Catalyst is a C# Natural Language Processing library built for speed. Inspired by spaCy's design, it brings pre-trained models, out-of-the box support for training word and document embeddings, and flexible entity recognition models.
MIT License
715 stars 73 forks source link

Add a quick Dependency Parsing example to the readme. #44

Closed cdibbs closed 2 years ago

cdibbs commented 3 years ago

Is your feature request related to a problem? Please describe.

I am having trouble figuring out how dependency parsing works. I found the AveragePerceptronDependencyParser and added it to the NLP pipeline after instantiating it with FromStoreAsync(Language.English, Version.Latest, "") but I don't know how to access its output. In particular, the DependencyType property on IToken looked promising, but always seemed to be the empty string.

Describe the solution you'd like

It would be nice to see a couple of quick examples for how to work with it such as extracting the root verb, subject, and object of a sentence.

Describe alternatives you've considered

Additional context

Storage.Current = new OnlineRepositoryStorage(new DiskStorage("catalyst-models"));
var nlp = await Pipeline.ForAsync(Language.English);
var doc = new Document("The quick brown fox jumps over the lazy dog", Language.English);
nlp.Add(await AveragePerceptronDependencyParser.FromStoreAsync(Language.English, Version.Latest, ""));
nlp.ProcessSingle(doc);

Thanks for your work on what looks like a very promising library!

ProductiveRage commented 3 years ago

This might be more simplistic than you are looking for if you're looking at the AveragePerceptronDependencyParser and wanting to extract a single root verb but you can tag the tokens in a document with a part-of-speech type like this:

Storage.Current = new OnlineRepositoryStorage(new DiskStorage("catalyst-models"));
var document = new Document("The quick brown fox jumps over the lazy dog", Language.English);
var nlp = await Pipeline.ForAsync(Language.English);
nlp.ProcessSingle(document);
foreach (var sentence in document)
{
    foreach (var word in sentence)
    {
        Console.WriteLine(word.POS + "\t" + word.Value);
    }
}

The output from the above code is this:

DET     The
ADJ     quick
ADJ     brown
NOUN    fox
VERB    jumps
ADP     over
DET     the
ADJ     lazy
NOUN    dog

(The PartOfSpeech enum values - DET, ADJ, etc.. - match the standard abbreviations that you will see used elsewhere, such as this "Part of Speech Tagging" from the tutorial of another NLP library).

cdibbs commented 3 years ago

@ProductiveRage I appreciate the well-written tips, but you are correct that I need that dependency structure to extract "dobj", "nsubj", and the like.

If this library doesn't quite support that, yet, I could do some educated guessing with simpler sentences in which earlier nouns and pronouns are more likely to be the subject. I'd rather not, though. Another option would be to finagle Python's SpaCy library via Python.NET, but that sounds brittle at best.

theolivenbaum commented 3 years ago

Hi @cdibbs -

Just checked quickly - it's strange that the value was supposed to be copied back to the token using this method, which ends up calling the methods here to store the values into the Document data store.

So I stumbled on this line - and then the issue is obvious 🤦‍♂️

It seems like the code to train the dependency parser is just not finished, and it is not yet predicting the labels for dependency type. I'll add this to my backlog - but if you want to take a try on implementing the training, happy to get a PR with this!

cdibbs commented 3 years ago

@theolivenbaum I wouldn't mind giving it a try. I don't have much experience, though, and I am not sure where the training data is. Is that publicly available somewhere? Thanks!

ADD-eNavarro commented 2 years ago

Hello there! I was wondering if I could use Catalyst to get a Parse Tree and I found this issue. Did it go somewhere, or is it stuck in the pile of TODOs?

Thank you.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.