Closed todd-cook closed 4 years ago
We should adopt in some form of the way they set up pipelines, which mimic some of the Spacy API behavior.
Agreed. This is what @kylepjohnson sketched out: a spaCy-like nlp
callable object that functions as the entry point to the pipeline. (Since we're in a multilingual context, we might suggest a convention where the name of the object is that of the language: latin
, oe
, etc.)
we should not make StanfordNLP a direct dependency. We should create wrappers which may load and interact with their published Pytorch models,
I want to make sure I understand this. Are you proposing that we not make stanfordnlp
the only library for building and running statistical models, and instead push the dependencies to the modules responsible for loading Stanford models for specific languages?
While CoNLL is laudable, the languages under CLTK's purview typically have limited resources in the way of available corpora, and standardized annotations. In our desire and long term goal to achieve state of the art, it's important that leave the door open to custom training data sets that won't be restricted to the CoNLL format.
I think I basically agree, but what if the problem is just one of writing converters from treebank X to CoNLL? I face this problem directly with OE. I've trained a parser, using the small ISWOC treebank. Accuracy isn't great (though not awful), so I'd like to use the much larger YCOE. Unfortunately the latter uses its own idiosyncratic annotations, and does not provide lemmas, which are required by the stanfordnlp trainer. POS tags too.
What to do? The more tractable path I'd think would be to convert the old YCOE treebank to CoNLL, run the corpus through a lemmatizer and POS tagger, and then train using the Stanford toolset. I suppose one could also try to unpeel the CoNLL bits, to feed the PyTorch models directly. More laborious still would be to rewrite the model structure of the Stanford NN in our own PyTorch (or even Keras!) representations. On balance though I'd probably start with translating the files and producing the additional, required representations -- an issue being that errors by the lemmatizer and POS tagger will end up contaminating the training file.
Let's see, what's my point :) ? Oh, just that it may be more practical to always (usually?) convert treebanks to CoNLL, in which case the dependency problem isn't particularly severe. I nevertheless agree that it smells better to not build in a hard dependency.
Here's an annoying knot: I have trained an OE lemmatizer using stanfordnlp
. As noted, the training required CoNLL files with complete annotations, including POS/morpho tags. Since the model requires POS tags and tokens, one has to provide these models for the language, pre-trained using the framework.
In other words, it's hard to not buy the whole kit and caboodle.
However we should not make StanfordNLP a direct dependency. We should create wrappers which may load and interact with their published Pytorch models, which we may allow users to download and install. Pytorch itself may reasonably become a first class CLTK dependency.
With what I've done with the NLP()
class, stanfordnlp
is baked in and I wouldn't know how to pull it out. Idealistically, I appreciate being careful about hard dependencies. That said, two considerations tip me towards making this a hard dependency:
@todd-cook I will keep an open mind about this one. More of a pain than the dependency, I feel, is instructing users on how to download the stanford models.
Question for anyone: What is the best way to code an optional dependency for us? or to do a try/ImportError and print / log a message?
I meant that we should probably automate the import of models and downloads as Stanford and Transformers and Keras has done. As long as a user explicitly names the model library we can fetch and automate. I guess it's just a question of which ones do we want to default to.
e.g.
from transformers import BertModel model = BertModel.from_pretrained('bert-base-uncased') I1122 23:18:52.256044 4743282112 file_utils.py:296] https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased-config.json not found in cache or force_download set to True, downloading to /var/folders/h7/209qkzcn4sd8vyj_7gbs15nc0000gn/T/tmpseikqnca
data_archive= get_file(fname=filename, origin=imdb_archive, cache_dir=file_cache, untar=False, extract=True) https://ai.stanford.edu/~amaas/data/sentiment aclImdb_v1.tar.gz
On Sun, Nov 24, 2019 at 7:59 PM Kyle P. Johnson notifications@github.com wrote:
However we should not make StanfordNLP a direct dependency. We should create wrappers which may load and interact with their published Pytorch models, which we may allow users to download and install. Pytorch itself may reasonably become a first class CLTK dependency.
With what I've done with the NLP() class, stanfordnlp is baked in and I wouldn't know how to pull it out. Idealistically, I appreciate being careful about hard dependencies. That said, two considerations tip me towards making this a hard dependency:
- Dependency parsing for Greek and Latin has been a much requested feature
- The majority of our users are still Classicists
@todd-cook https://github.com/todd-cook I will keep an open mind about this one. More of a pain than the dependency, I feel, is instructing users on how to download the stanford models.
Question for anyone: What is the best way to code an optional dependency for us? or to do a try/ImportError and print / log a message?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cltk/cltkv1/issues/11?email_source=notifications&email_token=AAFCOLLH4B2B7GB7K5W5DKLQVNERZA5CNFSM4IE6BFO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFBA2QA#issuecomment-557976896, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFCOLPCFVPS6Z763GJT5DDQVNERZANCNFSM4IE6BFOQ .
I meant that we should probably automate the import of models and downloads as Stanford and Transformers and Keras has done.
Understood. If this is the main concern, I will close this and will make a new issue for specifically this.
Following up in https://github.com/cltk/cltkv1/issues/23
Elsewhere use and possible inclusion of StanfordNLP Python library (https://github.com/stanfordnlp/stanfordnlp) has been discussed, e.g. here: (https://github.com/cltk/cltk/issues/921)
I've examined this library and would like to share my recommendations, reasons, and hunches.
Recommendations:
We should adopt in some form of the way they set up pipelines, which mimic some of the Spacy API behavior.
However we should not make StanfordNLP a direct dependency. We should create wrappers which may load and interact with their published Pytorch models, which we may allow users to download and install. Pytorch itself may reasonably become a first class CLTK dependency.
Reasons: The basic approach of StanfordNLP Python is to generate general purpose models based on the standardized CoNLL input data format. This has allowed StanfordNLP to leverage a standard format, and encourage repeatable, independently verifiable model generation, across a wide number of languages.
While CoNLL is laudable, the languages under CLTK's purview typically have limited resources in the way of available corpora, and standardized annotations. In our desire and long term goal to achieve state of the art, it's important that leave the door open to custom training data sets that won't be restricted to the CoNLL format.