Closed skwolvie closed 3 years ago
Hi @skwolvie For such volume I would recommend using the SparkNLP's YAkE implementation.
You can find more about it here https://nlp.johnsnowlabs.com/docs/en/annotators#yake
Since you can distribute Spark processing I think it could be a good fit for your use case.
Can you please provide a kickstart on how to do it with spark NLP? Also, If that is not possible i would like to understand why it takes so long with pandas apply method. It takes less than a second to apply yake to 1 document but the time taken increases exponentially as the no of documents increases. 100k rows is not a huge dataset.
I felt that deduplication is where it takes longer than anything.
Hi, It works well with 1 document however if i want to apply this kw_extractor to a 100k rows of documents with pandas apply it takes more than 2 days to complete. Is there any way of speeding up this process?