JohnSnowLabs / spark-nlp

State of the Art Natural Language Processing
https://sparknlp.org/
Apache License 2.0
3.87k stars 711 forks source link

How to train Linear Chain CRF for word segmentation? #7485

Closed jackieair closed 2 years ago

jackieair commented 2 years ago

Name of the Spark NLP feature whose docs need improvement: Linear Chain CRF

What you think the docs should say: Hi, I want to thank you for this great NLP project first.

I am new to NLP and want to use exactly LinearChainCrf for Chinese word segmentation. As I know CRF needs feature templates(or feature functions, like Unigram/Bigram) for training like CRF++.

However, I found there's no instruction about how to use LinearChainCrf. I don't see how to set the training pipeline for CRF(not NerCrf), and what dataset format it requires, etc.

Could you please offer some help : )

maziyarpanahi commented 2 years ago

Hi @jackieair

I don't believe we have LinearChainCrf in Spark or Spark NLP. However, for training new word segmentation models on languages that don't have whitespace like Chinese, Korean, etc. we have a feature called WordSegmenter:

I can see that we miss a notebook that demonstrates how to use this annotator for training a Word Segmenter. We will add this notebook shortly.

maziyarpanahi commented 2 years ago

@DevinTDHa Could you please add a new directory here chinese and have a notebook that shows how to use WordSegmenterApproach for training Chinese word segmentation?

https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/jupyter/training

maziyarpanahi commented 2 years ago

Hi, thanks for your prompt reply.

I guess you've implemented LinearChainCrf already: https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/ml/crf

Can this version be used for word segmentation or POS-Tagging? Sorry I have little knowledge in this area, so the question may sound dumb.

We use NerCrf for training Named Entity Recognition, however, since POS and NER are both token classification tasks then yes. You can just pretend your POS tags are NER tags. (we do have a very fast and accurate POS trainable annotator, but this can be used for POS as well. But not for word segmentation, it's a different task that requires changes in how CRF works, our model is designed for token classification at the moment).

But can be a good idea to look into existing CRF and see if we can use it for word segmentation if there is a paper for this implementation

maziyarpanahi commented 2 years ago

@jackieair I don't know what happened, your comments disappeared! but it exists in my reply.

jackieair commented 2 years ago

Hi, thanks for your prompt reply. I guess you've implemented LinearChainCrf already: https://github.com/JohnSnowLabs/spark-nlp/tree/master/src/main/scala/com/johnsnowlabs/ml/crf Can this version be used for word segmentation or POS-Tagging? Sorry I have little knowledge in this area, so the question may sound dumb.

We use NerCrf for training Named Entity Recognition, however, since POS and NER are both token classification tasks then yes. You can just pretend your POS tags are NER tags. (we do have a very fast and accurate POS trainable annotator, but this can be used for POS as well. But not for word segmentation, it's a different task that requires changes in how CRF works, our model is designed for token classification at the moment).

But can be a good idea to look into existing CRF and see if we can use it for word segmentation if there is a paper for this implementation

Did you mean that I can use NerCrf for POS Tagging?

If yes, then I can train a NerCrfModel for POS task, I noticed NerCrf is based on LinearChainCrf.

The dataset I would use is backoff2005, so the major work is to convert the format of backoff2005 to the required style, do I understand it well? Or I only need to set documentAssembler, tokenizer, posTagger, embeddings like this: https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/main/scala/com/johnsnowlabs/nlp/annotators/ner/crf/NerCrfApproach.scala

jackieair commented 2 years ago

@jackieair I don't know what happened, your comments disappeared! but it exists in my reply.

Hi, yes, I don't know why this happend. : )

maziyarpanahi commented 2 years ago

Yes, since POS is like NER, you can use NerCrfApproach for training POS. There are examples of how to do so:

You need to have a CoNLL 2003 format dataset. (you need to use something to convert your dataset to that format which is the acceptable format for most of the trainable annotators in Spark NLP.)

That repository has lots of examples, I highly suggest this part which teaches you how to do most of the NLP tasks in Spark NLP (from notebook #1 to #16): https://github.com/JohnSnowLabs/spark-nlp-workshop/tree/master/tutorials/Certification_Trainings/Public

jackieair commented 2 years ago

@maziyarpanahi Many many thanks!

I'm so grateful for your kind help and prompt replies. wish you have a good day!

maziyarpanahi commented 2 years ago

@jackieair Here is a simple example of how to train word segmenter : https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/chinese/word-segmentation/WordSegmenter_train_chinese_segmentation.ipynb

Obviously, the larger and the better the dataset the higher is the accuracy.

jackieair commented 2 years ago

Hi, I'm trying to train a NerCrfModel by using line69-104 from following: https://github.com/JohnSnowLabs/spark-nlp/blob/master/src/main/scala/com/johnsnowlabs/nlp/annotators/ner/crf/NerCrfApproach.scala

However, when I use the pretrained PerceptronModel and WordEmbeddingsModel, some error occurred like this:

Exception in thread "main" com.amazonaws.SdkClientException: Unable to execute HTTP request: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1175) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1121) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4921) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4867) at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1467) at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1326) at com.johnsnowlabs.client.aws.AWSGateway.getObjectFromS3(AWSGateway.scala:91) at com.johnsnowlabs.client.aws.AWSGateway.getMetadata(AWSGateway.scala:77) at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:62) at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:68) at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:145) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:445) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadResource(ResourceDownloader.scala:370) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:405) at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.downloadModel(ResourceDownloader.scala:400) at com.johnsnowlabs.nlp.HasPretrained.pretrained(HasPretrained.scala:44) at com.johnsnowlabs.nlp.HasPretrained.pretrained$(HasPretrained.scala:41) at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel$.com$johnsnowlabs$nlp$annotators$pos$perceptron$ReadablePretrainedPerceptron$$super$pretrained(PerceptronModel.scala:160) at com.johnsnowlabs.nlp.annotators.pos.perceptron.ReadablePretrainedPerceptron.pretrained(PerceptronModel.scala:154) at com.johnsnowlabs.nlp.annotators.pos.perceptron.ReadablePretrainedPerceptron.pretrained$(PerceptronModel.scala:154) at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel$.pretrained(PerceptronModel.scala:160) at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel$.pretrained(PerceptronModel.scala:160) at com.johnsnowlabs.nlp.HasPretrained.pretrained(HasPretrained.scala:51) at com.johnsnowlabs.nlp.HasPretrained.pretrained$(HasPretrained.scala:51) at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel$.com$johnsnowlabs$nlp$annotators$pos$perceptron$ReadablePretrainedPerceptron$$super$pretrained(PerceptronModel.scala:160) at com.johnsnowlabs.nlp.annotators.pos.perceptron.ReadablePretrainedPerceptron.pretrained(PerceptronModel.scala:148) at com.johnsnowlabs.nlp.annotators.pos.perceptron.ReadablePretrainedPerceptron.pretrained$(PerceptronModel.scala:148) at com.johnsnowlabs.nlp.annotators.pos.perceptron.PerceptronModel$.pretrained(PerceptronModel.scala:160) at com.huawei.bigdata.ml.crftest.CrfTest$.main(CrfTest.scala:90) at com.huawei.bigdata.ml.crftest.CrfTest.main(CrfTest.scala) Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.ssl.Alerts.getSSLException(Alerts.java:192) at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1946) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:316) at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:310) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1639) at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:223) at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1037) at sun.security.ssl.Handshaker.process_record(Handshaker.java:965) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1064) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1367) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1395) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1379) at com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:436) at com.amazonaws.thirdparty.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:384) at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.connectSocket(SdkTLSSocketFactory.java:142) at com.amazonaws.thirdparty.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) at com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76) at com.amazonaws.http.conn.$Proxy20.connect(Unknown Source) at com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) at com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) at com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) at com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) at com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1297) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113) ... 35 more Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:450) at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:317) at sun.security.validator.Validator.validate(Validator.java:262) at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:330) at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:237) at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:132) at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1621) ... 62 more Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141) at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126) at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280) at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:445) ... 68 more

I tried many methods but unfortunately, they all failed.

Is there any way to download the pretrained model and I can directly import them from local instead of downloading them when training in cluster?

maziyarpanahi commented 2 years ago

It seems you are behind some sort of proxy/firewall, you can use Google Colab or Kaggle for free with GPU, or you can follow these steps to simply download any models and use .load for offline use: https://github.com/JohnSnowLabs/spark-nlp#offline

I'll close this as it is no longer an issue. (mainly WordSegmenter example)

jackieair commented 2 years ago

It seems you are behind some sort of proxy/firewall, you can use Google Colab or Kaggle for free with GPU, or you can follow these steps to simply download any models and use .load for offline use: https://github.com/JohnSnowLabs/spark-nlp#offline

I'll close this as it is no longer an issue. (mainly WordSegmenter example)

Thanks!