Closed jackieair closed 2 years ago
Hi @jackieair
I don't believe we have LinearChainCrf in Spark or Spark NLP. However, for training new word segmentation models on languages that don't have whitespace like Chinese, Korean, etc. we have a feature called WordSegmenter:
I can see that we miss a notebook that demonstrates how to use this annotator for training a Word Segmenter. We will add this notebook shortly.
@DevinTDHa Could you please add a new directory here chinese
and have a notebook that shows how to use WordSegmenterApproach for training Chinese word segmentation?
https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/jupyter/training
Hi, thanks for your prompt reply.
I guess you've implemented LinearChainCrf already: https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/ml/crf
Can this version be used for word segmentation or POS-Tagging? Sorry I have little knowledge in this area, so the question may sound dumb.
We use NerCrf for training Named Entity Recognition, however, since POS and NER are both token classification tasks then yes. You can just pretend your POS tags are NER tags. (we do have a very fast and accurate POS trainable annotator, but this can be used for POS as well. But not for word segmentation, it's a different task that requires changes in how CRF works, our model is designed for token classification at the moment).
But can be a good idea to look into existing CRF and see if we can use it for word segmentation if there is a paper for this implementation
@jackieair I don't know what happened, your comments disappeared! but it exists in my reply.
Hi, thanks for your prompt reply. I guess you've implemented LinearChainCrf already: https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/ml/crf Can this version be used for word segmentation or POS-Tagging? Sorry I have little knowledge in this area, so the question may sound dumb.
We use NerCrf for training Named Entity Recognition, however, since POS and NER are both token classification tasks then yes. You can just pretend your POS tags are NER tags. (we do have a very fast and accurate POS trainable annotator, but this can be used for POS as well. But not for word segmentation, it's a different task that requires changes in how CRF works, our model is designed for token classification at the moment).
But can be a good idea to look into existing CRF and see if we can use it for word segmentation if there is a paper for this implementation
Did you mean that I can use NerCrf for POS Tagging?
If yes, then I can train a NerCrfModel for POS task, I noticed NerCrf is based on LinearChainCrf.
The dataset I would use is backoff2005, so the major work is to convert the format of backoff2005 to the required style, do I understand it well? Or I only need to set documentAssembler, tokenizer, posTagger, embeddings like this: https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/main/scala/com/johnsnowlabs/nlp/annotators/ner/crf/NerCrfApproach.scala
@jackieair I don't know what happened, your comments disappeared! but it exists in my reply.
Hi, yes, I don't know why this happend. : )
Yes, since POS is like NER, you can use NerCrfApproach for training POS. There are examples of how to do so:
You need to have a CoNLL 2003 format dataset. (you need to use something to convert your dataset to that format which is the acceptable format for most of the trainable annotators in Spark NLP.)
That repository has lots of examples, I highly suggest this part which teaches you how to do most of the NLP tasks in Spark NLP (from notebook #1 to #16): https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Public
@maziyarpanahi Many many thanks!
I'm so grateful for your kind help and prompt replies. wish you have a good day!
@jackieair Here is a simple example of how to train word segmenter : https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/chinese/word-segmentation/WordSegmenter_train_chinese_segmentation.ipynb
Obviously, the larger and the better the dataset the higher is the accuracy.
Hi, I'm trying to train a NerCrfModel by using line69-104 from following: https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/main/scala/com/johnsnowlabs/nlp/annotators/ner/crf/NerCrfApproach.scala
However, when I use the pretrained PerceptronModel and WordEmbeddingsModel, some error occurred like this:
Exception in thread "main" com.amazonaws.SdkClientException: Unable to execute HTTP request: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1175) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1121) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4921) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4867) at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1467) at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1326) at com.johnsnowlabs.client.aws.AWSGateway.getObjectFromS3(AWSGateway.scala:91) at com.johnsnowlabs.client.aws.AWSGateway.getMetadata(AWSGateway.scala:77) at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:62) at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:68) at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:145) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:445) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadResource(ResourceDownloader.scala:370) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:405) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:400) at com.johnsnowlabs.nlp.HasPretrained.pretrained(HasPretrained.scala:44) at com.johnsnowlabs.nlp.HasPretrained.pretrained$(HasPretrained.scala:41) at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel$.com$johnsnowlabs$nlp$annotators$pos$perceptron$ReadablePretrainedPerceptron$$super$pretrained(PerceptronModel.scala:160) at com.johnsnowlabs.nlp.annotators.pos.perceptron.ReadablePretrainedPerceptron.pretrained(PerceptronModel.scala:154) at com.johnsnowlabs.nlp.annotators.pos.perceptron.ReadablePretrainedPerceptron.pretrained$(PerceptronModel.scala:154) at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel$.pretrained(PerceptronModel.scala:160) at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel$.pretrained(PerceptronModel.scala:160) at com.johnsnowlabs.nlp.HasPretrained.pretrained(HasPretrained.scala:51) at com.johnsnowlabs.nlp.HasPretrained.pretrained$(HasPretrained.scala:51) at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel$.com$johnsnowlabs$nlp$annotators$pos$perceptron$ReadablePretrainedPerceptron$$super$pretrained(PerceptronModel.scala:160) at com.johnsnowlabs.nlp.annotators.pos.perceptron.ReadablePretrainedPerceptron.pretrained(PerceptronModel.scala:148) at com.johnsnowlabs.nlp.annotators.pos.perceptron.ReadablePretrainedPerceptron.pretrained$(PerceptronModel.scala:148) at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel$.pretrained(PerceptronModel.scala:160) at com.huawei.bigdata.ml.crftest.CrfTest$.main(CrfTest.scala:90) at com.huawei.bigdata.ml.crftest.CrfTest.main(CrfTest.scala) Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1946) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:316) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:310) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1639) at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:223) at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1037) at sun.security.ssl.Handshaker.process_record(Handshaker.java:965) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1064) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379) at com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:436) at com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384) at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.connectSocket(SdkTLSSocketFactory.java:142) at com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) at com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76) at com.amazonaws.http.conn.$Proxy20.connect(Unknown Source) at com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) at com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) at com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) at com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) at com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1297) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113) ... 35 more Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:450) at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:317) at sun.security.validator.Validator.validate(Validator.java:262) at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:330) at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:237) at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:132) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1621) ... 62 more Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141) at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126) at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280) at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:445) ... 68 more
I tried many methods but unfortunately, they all failed.
Is there any way to download the pretrained model and I can directly import them from local instead of downloading them when training in cluster?
It seems you are behind some sort of proxy/firewall, you can use Google Colab or Kaggle for free with GPU, or you can follow these steps to simply download any models and use .load
for offline use:
https://github.com/JohnSnowLabs/spark-nlp#offline
I'll close this as it is no longer an issue. (mainly WordSegmenter example)
It seems you are behind some sort of proxy/firewall, you can use Google Colab or Kaggle for free with GPU, or you can follow these steps to simply download any models and use
.load
for offline use: https://github.com/JohnSnowLabs/spark-nlp#offlineI'll close this as it is no longer an issue. (mainly WordSegmenter example)
Thanks!
Name of the Spark NLP feature whose docs need improvement: Linear Chain CRF
What you think the docs should say: Hi, I want to thank you for this great NLP project first.
I am new to NLP and want to use exactly LinearChainCrf for Chinese word segmentation. As I know CRF needs feature templates(or feature functions, like Unigram/Bigram) for training like CRF++.
However, I found there's no instruction about how to use LinearChainCrf. I don't see how to set the training pipeline for CRF(not NerCrf), and what dataset format it requires, etc.
Could you please offer some help : )