fergiemcdowall / stopword

A module for node.js and the browser that takes in text and strips it of stopwords
MIT License
231 stars 34 forks source link

Add Punjabi Gurmukhi stopwords and test #21

Closed eklem closed 6 years ago

eklem commented 7 years ago

Could base it on this paper, but not sure how the license situation is: http://ijoes.vidyapublications.com/paper/Vol8/15-Vol8.pdf

eklem commented 7 years ago

Use the new stopword-trainer. Still needs some help from someone with good punjabi knowledge, verifying the result.

eklem commented 6 years ago

Check if wikipedia.org has articles in Punjabi and test if wikiminer can help getting the content.

eklem commented 6 years ago

https://en.wikipedia.org/wiki/Special:AllPages?from=&to=&namespace=0

eklem commented 6 years ago

So, Punjabi is split in two different alphabets: Gurmukhi and Shahmukhi.

They each have their own Wikipedia site, so it'll be possible to use the wikipedia-stopword-crawler to create a stopword list for each:

eklem commented 6 years ago

Let me know if you need this. It will happen, but it will happen sooner with someone needing it 😃

manmeet3591 commented 6 years ago

Hi Eklem

I know the gurmukhi script and Punjabi language as a native speaker, and I work at the intersection of deep learning and atmospheric science. I came here from the article https://medium.com/norch/input-text-ouput-stopwords-bf40f4f22900 Please let me know how I can contribute in the code and the language. I would be happy to be a part of both.

Regards Manmeet

eklem commented 6 years ago

Cool, @manmeet3591. The most important is to verify that the outcome is words of little meaning. And of course that somebody needs this. I'll start with crawling some Wikipedia articles.

manmeet3591 commented 6 years ago

Thats awesome @eklem I understand that the outcome of words needs to be meaningful and I would give time to it. Additionally it would be of great use to many people, Let me know if we can do more on the code side as well.

eklem commented 6 years ago

Nice, I'll just verify my process tonight with a Norwegian dataset to see that everything still works as intended. The stopword-trainer is at it as we speak. Then I'll start tomorrow on tweaking my wikipedia-crawlers to match the Gurmukhi Wikipedia-site and possibly start crawling Wikipedia article URLs. When that's done, I need to crawl the pages and run the stopword-trainer.

And when that's done, I need your help to verify 😄

manmeet3591 commented 6 years ago

Perfect :)

eklem commented 6 years ago

next page-check can be done on "ਅਗਲਾ ਸਫ਼ਾ"

manmeet3591 commented 6 years ago

Yes it can be done on "ਅਗਲਾ ਸਫ਼ਾ"

eklem commented 6 years ago

So, a little status. Have some problem matching "Next page" (in gurmukhi) with what I find in the HTML. I'll get to that tomorrow.

Before trying to crawl, I managed to weed out a bug in the stopword-trainer so it now tackles more than a-z letters, and hopefully gurmukhi. I've added tests for Norwegian language.

So, tomorrow, fix the "next page"-issue and crawl url's.

manmeet3591 commented 6 years ago

Good going @eklem

eklem commented 6 years ago

Got the URL crawler going, so now I have little over a 30000 URLs to crawl later. Will look into that tomorrow.

manmeet3591 commented 6 years ago

is the url crawler still going on ?

eklem commented 6 years ago

It's finished and looking good. I was sick yesterday, so didn't get anything done then. The crawler for content just needs a small tweak I think, and I'll hopefully get that going today.

eklem commented 6 years ago

Here's the output from the stopword-trainer after crawling 30 docs (have urls for approximately 30000 docs) and selecting the 500 most used words:

["ਹੈ।","ਦਾ","ਦੇ","ਹੈ","ਵਿੱਚ","ਅਤੇ","ਇੱਕ","ਨੂੰ","ਇਹ","ਲਈ","ਤੋਂ","ਦੀ","ਸਾਲ","ਇਸ","ਹਨ।","ਨਾਲ","ਦਿਨ","ਡੋਮੇਨ","ਜੋ","ਜਾਂਦਾ","ਤੇ","ਹਨ","0","ਹੁੰਦਾ","ਵਾਂ","ਦੁਆਰਾ","ਕਰਨ","1","ਵੀ","ਲੈੱਵਲ","ਕਿ","ਜਿਸ","ਪਰ","ਮੁਤਾਬਕ","ਕਲੰਡਰ","ਬਾਕੀ","ਕੋਡ","ਰਜਿਸਟਰੀ","।","ਕੀਤਾ","the","in","ਵਿਚ","ਦੇਸ਼","ਲੀਪ","ਗ੍ਰੈਗਰੀ","010","ਜਾਂ","ਦੀਆਂ","ਨੇ","ਨਹੀਂ","ਕਰ","ਉਹ","ਗਿਆ","ਇਕ","ਸਕਦਾ","ਖੇਡ","2","ਸ਼ੁਰੂ","of","ਬਾਅਦ","ਇੰਟਰਨੈੱਟ","ਟਾੱਪ","ਕੁਝ","ਉਸ","ਵਿਸ਼ਵ","and","to","level","ਹੋਰ","ਸੇਕੰਡ","ਸਦੀ","11","ਜਾ","ਕਰਦੇ","ਤੱਕ","ਵਰਤਿਆ","ਹੁੰਦੀ","ਹੋ","ਨਾਂ","top","ਸੀ।","ਜਾਂਦੀ","cctld","ਕੋਈ","ਸੀ","ਕੀਤੀ","a","or","is","ਗਏ","ਆਈ","domain","ਲੜੀ","ਅੰਦਰ","ਕਰਦੀ","ਆਪਣੇ","ਅੱਖਰ","ਕੰਪਿਊਟਰ","ਭਾਰਤ","ਜਿੱਥੇ","ਸਾਰੇ","ਵੀਂ","ਹੋਇਆ।","ਚਲਾਇਆ","ਸ਼ਾਮਲ","ਪਹਿਲਾਂ","ਇੰਟਰਨੈਟ","ਜਿਆਦਾ","ਯੂਨਾਈਟਡ","nic","ਜਿਵੇਂ","ਤੌਰ","ਕਿਉਂਕਿ","ਜਿਸਨੂੰ","ਖੇਡਣ","ਰਿਹਾ","ਕਾਰਨ","ਅਮਰੀਕਾ","ਏ","ਹੇਂਠ","ਹੇਠ","as","ਬਹੁਤ","ਕਰਕੇ","ਸ਼ਬਦ","ਮੂਲ","ਚਾਰ","ਦਹਾਕਾ","26","ਆਉਂਦਾ","ਪੱਧਰ","ਸਰਕਾਰ","ਕੇ","ਹੋਈ","ਜਰਮਨੀ","ਹੋਣ","ਰਜਿਸਟਰੀਆਂ","ਨਾਮ","ਦੋ","ਤਾਂ","ਅਧਾਰ","10","ਦ੍ਰਿਸ਼ਟੀਕੋਣ","ਕੰਮ","ਖ਼ਤਮ","ਕੇਂਦਰ","ਦੇਣ","ਗਈਆਂ","national","ਰੱਖੇ","ca","ਥੱਲੇ","ਐਨ","ਸੰਸਥਾਵਾਂ","ਕੀਤੀਆਂ","space","state","this","which","ਬਣ","ਖੇਤਰਾਂ","ਵਿਸ਼ੇਸ਼ਤਾਵਾਂ","ਮਿਲਦਾ","ਜਦੋਂ","ਬਣਦਾ","ਦੇਸੀ","ਸ਼ਨੀਵਾਰ","ਦੂਜੀ","0100","21","ਜਾਣੀ","ਟਰੈਕ","ਸਮਝ","ਜਰਮਨ","9","4","3","ਸਿੱਧੀ","ਲਾਈਨ","ਹਰ","ਅਪਰੈਲ","ਦਿੰਦੇ","ਹੋਏ","ਹੁੰਦੇ","ਸੁਧਾਰ","ਬੇਅੰਤ","ਆਫ","ਦੇਖ","ਕਿੱਤਾ","1985","ਰਾਸ਼ਟਰੀ","ਸਕਦੀ","ਜੀ","ਥਾਂ","ਇੱਥੇ","ਤਰਾਂ","ਕਈ","ਅੰਤਰਰਾਸ਼ਟਰੀ","names","many","city","be","was","original","other","from","ਅੰਤ","ਰੂਪ","ਮਹੱਤਵਪੂਰਨ","ਲੱਗਿਆ","ਸਥਾਨ","ਸੈੱਟ","ਜਾਣਕਾਰੀ","ਇਸਨੂੰ","ਵੱਡਾ","ਤੱਤ","ਉਲਟ","ਪਹਿਲੀ","ਆਲੋਚਕਾਂ","ਰੂਸੀ","235","131","130","ਮਈ","325","324","41","ਫ਼ਰਵਰੀ","ਕੱਤਕ","51","315","314","ਨਵੰਬਰ","ਮੱਘਰ","345","344","ਦਸੰਬਰ","174","192","191","ਜੁਲਾਈ","356","355","ਜਨਵਰੀ","ਸ਼ਾ","ਨਾ","ਪੋਹ","026","265","101","100","143","223","222","ਅਗਸਤ","82","284","283","ਅਕਤੂਬਰ","ਉਗਲੀਆਂ","ਮਨੁੱਖੀ","ਮੰਨਿਆ","ਦਸ਼ਮਲਵ","ਪਹਿਲਾ","ਸੰਖਿਆ","ਜਿਸਤ","ਪ੍ਰਕਿਰਤਿਕ","ਦਸ","1090","09","01099","06","01066","1060","01060","0106","0102","ਐਤਵਾਰ","1010","01010","ਗਤੀਵਿਧੀਆਂ","ਲੋਕਪ੍ਰਿਯ","l","ਦੌੜੀ","ਈਵੇਂਟ","ਅਥਲੈਟਿਕਸ","ਦੌੜ","ਮੀਟਰ","ਗੀਤ","ਗਾਇਆ","ਕੌਰ","ਰਣਜੀਤ","ਸਦੀਕ","ਮੁਹੰਮਦ","ਨੋਟ","cft","ads","ਅਸਥਿਰਾਂਕਾਂ","ਅਜੋਕੇ","ਸੋਮਾ","ਗਹਿਰੀ","ਗੁਣਾਤਮਿਕ","ਵਜਾਏ","ਵਿਧੀਆਂ","ਅਨੁਮਾਨਾਂ","ਗਿਣਾਤਮਿਕ","ਹੁਣ","ਬਣਾਉਂਦੀ","ਸ਼ੋਧਾਂ","ਜਿਮੇਵਾਰ","ਤੱਥ","ਅਨੰਤ","ਗਿਣਤੀ","ਕਲਰਾਂ","ਐਕਸਪੈਂਸ਼ਨ","n","ਸਕੀਮ","ਸੰਖੇਪਤਾ","ਪਛਾਣੀ","ਚੰਗੀ","ਮਕੈਨਿਕਸ","ਸਟੈਟਿਸਟੀਕਲ","ਥਿਊਰੀ","ਫੀਲਡ","ਕੁਆਂਟਮ","ਖੇਡਦਾ","ਬੁਨ੍ਦੇਸਲੀਗ","ਅਧਾਰਤ","ਸਟੇਡੀਅਮ","ਰਾਈਨਐਨਰਜੀ","ਸਥਿੱਤ","ਵਿਖੇ","ਮਸ਼ਹੂਰ","ਕਲਨ","ਕਲੱਬ","ਫੁੱਟਬਾਲ","01","ਲੇਬਲ","ਰਿਕਾਰਡ","71","disc","ਕੁਰਬਾਨ","ਬਹਾਲ","ਨੈਟਵਰਕ","ਪਰਵੇਸ਼","ਰੁਕਾਵਟ","ਕੋਰਬੇਨਿਕ","ਆਤਮਾਵਾਂ","ਘੇਰਾ","ਕੋਰਬੇਨੀਕ","ਫਾਈਨਲ","ਬ੍ਰੇਸਲੇਟ","ਦੂਰ","ਬਰੈਸਲੇਟ","ਯਾਦ","ਹਾਰਾਲਡ","ਸ਼ੈਡੋ","ਸੰਕੇਤ","ਸਲਾਹ","ਹਅਰਵਿਕ","ਹਰਲਡ","ਰਹੇ","ਜੀਅ","ਰਾਹੀਂ","ਅਵਤਾਰਾਂ","ਏਰਿਆ","ਅਗਲਾ","ਤਾਰਵੋਸ","ਓਪਰੇਸ਼ਨ","ਕਰਦੀ।","ਬੰਦ","ਤਾਜ਼ਾ","ਬਣਦੀ","ਮਜ਼ਬੂਤ","ਕੁਬਿਆ","ਦੌਰਾਨ","ਬੇਸਬਰੇ","ਮਾਛਾ","ਸ਼ਰਾਪ","ਮੀਆ","ਮਿਆ","ਮੈਂਬਰ","ਤਲ","ਕਾਪੀ","ਨੈਟ","ਸਮੱਸਿਆ","ਅਸਥਿਰ","ਵਧਦੀ","ਸਰਵਰ","ਵਰਤਮਾਨ","ਵੇਖਦਾ","27","ਪੂਲ","ਸੰਸਾਧਨਾਂ","ਗੋਰਰੇ","ਸੰਚਾਲਨ","ਸਹਿਮਤ","ਸ਼ਰਾਪਾਂ","ਹੋਰਨਾਂ","ਪਿੰਗਲਾ","ਕਿਊਬੀਆ","ਲਿਓਸ","ਮੀਟਿੰਗ","ਖੁਲਾਸੇ","25","ਖਰਾਬ","ਨਾਪਾਕ","ਦੁਨੀਆਂ","ਨਤੀਜਾ","ਤਬਾਹ","ਫਿੱਡਚੇਲ","ਮੌਨਸਟਰ","ਯੋਜਨਾ","ਸਰਾਪ","ਵਧਾਉਣ","ਹੈਲਬਾ","24","ਜ਼ਰੀਏ","ਨਵੇਂ","ਨਿਰਣੇ","ਦੋਨਾਂ","ਸੁਗੰਧਿਤ","ਹਾਲਤਾਂ","ਦੱਸਦਾ","23","ਖਤਮ","ਖੁਦ","ਅਹਿਸਾਸ","22","ਫੈਲ","ਸੰਸਾਰ","ਲੱਗਦਾ","ਪਰਤਦੇ","ਟਾਊਨ","ਸ਼ਰਾਰਤੀ","ਮੈਗੁਸ","ਭਟਕਣ","ਫਿਰਦੌਸ","ਜਗ੍ਹਾ","ਮਦਦ","ਸਲੱਮ","ਨੈੱਟ","ਧਿਰ","ਵਿਰੋਧੀ","ਦਿਖਾਈ","ਐਪੀਟੀਫਾ","ਟਕਰਾਓਮ","ਵੇਵ","ਕਵਰਡ","ਇਨਸ","ਸੁਝਾਅ","ਭਟਕ","ਸੰਪਰਕ","ਸੰਖੇਪ","ਲੱਦਣਾਂ","ਦੁਹਰਾਉਂਦੇ","ਮੁਸ਼ਕਲ","ਮਿਲਣ","ਏਆਈ","ਦੱਸਦੀ","ਆਰਾ","ਹਰਾਉਣ","ਸ਼ਕਤੀ","ਸਮਾਨ","ਇਨੀਸ","ਭੇਜਦਾ","19","ਦਿਵਾਉਂਦਾ","ਯਕੀਨ","ਪਰਖਣ","ਦਖਲ","ਹੇਲਾਬਾ","ਫੇਲ","ਐਨਕ੍ਰਿਪਟ","ਟਾਈਟਲਾਈਟ","ਮਿਟਾਉਣ","ਕਾਈਟ","18","ਘੋਸ਼ਿਤ","ਗ਼ੈਰਕਾਨੂੰਨੀ","ਬਰੇਸਲੈੱਟ","ਕਥਾ","ਪ੍ਰਸ਼ਾਸਕ","ਐਨਕੌਨਮੈਂਟ","17","ਬਚਦੇ","ਕਿਊਬਿਆ","ਬਦਲ","ਵੱਡੇ","ਹਰਾਉਂਦੇ","ਸਫੈਥ","ਰੱਖਿਆ।","ਓਰਾਕਾ","ਸਕੈਥ","ਛਾਣਬੀਣ","ਲੀਡਾਂ","ਫੈਸਲਾ","ਸਹਿਯੋਗ","ਕੋਮਾ"]

As you see its quite some English words there. We'll remove them manually. Also, because Wikipedia is an encyclopedia, there will be words often used that you actually bears meaning, and thus is not a stopword. We'll remove them manually too.

500 words is a lot, we can maybe make a list that is 200 or 250 words long. Since it's sorted on frequency in use, it can be sliced from the bottom to be less agressive.

If the result looks not all that bad, I'll start the process of crawling docs this evening.

manmeet3591 commented 6 years ago

Hi @eklem Yeah the result looks okay, you can start the process of crawling the docs. I am removing the english words and the list is as below । is fullstop in punjabi so i am removing that as well, we can consider it as a separate word i guess which the stopword-trainer has already taken

["ਹੈ","ਦਾ","ਦੇ","ਹੈ","ਵਿੱਚ","ਅਤੇ","ਇੱਕ","ਨੂੰ","ਇਹ","ਲਈ","ਤੋਂ","ਦੀ","ਸਾਲ","ਇਸ","ਹਨ।","ਨਾਲ","ਦਿਨ","ਡੋਮੇਨ","ਜੋ","ਜਾਂਦਾ","ਤੇ","ਹਨ","ਹੁੰਦਾ","ਵਾਂ","ਦੁਆਰਾ","ਕਰਨ","ਵੀ","ਲੈੱਵਲ","ਕਿ","ਜਿਸ","ਪਰ","ਮੁਤਾਬਕ","ਕਲੰਡਰ","ਬਾਕੀ","ਕੋਡ","ਰਜਿਸਟਰੀ","।","ਕੀਤਾ","the","ਵਿਚ","ਦੇਸ਼","ਲੀਪ","ਗ੍ਰੈਗਰੀ","ਜਾਂ","ਦੀਆਂ","ਨੇ","ਨਹੀਂ","ਕਰ","ਉਹ","ਗਿਆ","ਇਕ","ਸਕਦਾ","ਖੇਡ","ਸ਼ੁਰੂ","ਬਾਅਦ","ਇੰਟਰਨੈੱਟ","ਟਾੱਪ","ਕੁਝ","ਉਸ","ਵਿਸ਼ਵ","and","level","ਹੋਰ","ਸੇਕੰਡ","ਸਦੀ","ਜਾ","ਕਰਦੇ","ਤੱਕ","ਵਰਤਿਆ","ਹੁੰਦੀ","ਹੋ","ਨਾਂ","ਸੀ","ਜਾਂਦੀ","ਕੋਈ","ਸੀ","ਕੀਤੀ","ਗਏ","ਆਈ","ਲੜੀ","ਅੰਦਰ","ਕਰਦੀ","ਆਪਣੇ","ਅੱਖਰ","ਕੰਪਿਊਟਰ","ਭਾਰਤ","ਜਿੱਥੇ","ਸਾਰੇ","ਵੀਂ","ਹੋਇ","ਚਲਾਇਆ","ਸ਼ਾਮਲ","ਪਹਿਲਾਂ","ਇੰਟਰਨੈਟ","ਜਿਆਦਾ","ਯੂਨਾਈਟਡ","ਜਿਵੇਂ","ਤੌਰ","ਕਿਉਂਕਿ","ਜਿਸਨੂੰ","ਖੇਡਣ","ਰਿਹਾ","ਕਾਰਨ","ਅਮਰੀਕਾ","ਏ","ਹੇਂਠ","ਹੇਠ","ਬਹੁਤ","ਕਰਕੇ","ਸ਼ਬਦ","ਮੂਲ","ਚਾਰ","ਦਹਾਕਾ","ਆਉਂਦਾ","ਪੱਧਰ","ਸਰਕਾਰ","ਕੇ","ਹੋਈ","ਜਰਮਨੀ","ਹੋਣ","ਰਜਿਸਟਰੀਆਂ","ਨਾਮ","ਦੋ","ਤਾਂ","ਅਧਾਰ","ਦ੍ਰਿਸ਼ਟੀਕੋਣ","ਕੰਮ","ਖ਼ਤਮ","ਕੇਂਦਰ","ਦੇਣ","ਗਈਆਂ","ਰੱਖੇ","ਥੱਲੇ","ਐਨ","ਸੰਸਥਾਵਾਂ","ਕੀਤੀਆਂ","ਬਣ","ਖੇਤਰਾਂ","ਵਿਸ਼ੇਸ਼ਤਾਵਾਂ","ਮਿਲਦਾ","ਜਦੋਂ","ਬਣਦਾ","ਦੇਸੀ","ਸ਼ਨੀਵਾਰ","ਦੂਜੀ","ਜਾਣੀ","ਟਰੈਕ","ਸਮਝ","ਜਰਮਨ","ਸਿੱਧੀ","ਲਾਈਨ","ਹਰ","ਅਪਰੈਲ","ਦਿੰਦੇ","ਹੋਏ","ਹੁੰਦੇ","ਸੁਧਾਰ","ਬੇਅੰਤ","ਆਫ","ਦੇਖ","ਕਿੱਤਾ","ਰਾਸ਼ਟਰੀ","ਸਕਦੀ","ਜੀ","ਥਾਂ","ਇੱਥੇ","ਤਰਾਂ","ਕਈ","ਅੰਤਰਰਾਸ਼ਟਰੀ","ਅੰਤ","ਰੂਪ","ਮਹੱਤਵਪੂਰਨ","ਲੱਗਿਆ","ਸਥਾਨ","ਸੈੱਟ","ਜਾਣਕਾਰੀ","ਇਸਨੂੰ","ਵੱਡਾ","ਤੱਤ","ਉਲਟ","ਪਹਿਲੀ","ਆਲੋਚਕਾਂ","ਰੂਸੀ","ਫ਼ਰਵਰੀ","ਕੱਤਕ","ਨਵੰਬਰ","ਮੱਘਰ","ਦਸੰਬਰ","ਜੁਲਾਈ","ਜਨਵਰੀ","ਸ਼ਾ","ਨਾ","ਪੋਹ","ਅਗਸਤ","ਅਕਤੂਬਰ","ਉਗਲੀਆਂ","ਮਨੁੱਖੀ","ਮੰਨਿਆ","ਦਸ਼ਮਲਵ","ਪਹਿਲਾ","ਸੰਖਿਆ","ਜਿਸਤ","ਪ੍ਰਕਿਰਤਿਕ","ਦਸ","ਐਤਵਾਰ","ਗਤੀਵਿਧੀਆਂ","ਲੋਕਪ੍ਰਿਯ","l","ਦੌੜੀ","ਈਵੇਂਟ","ਅਥਲੈਟਿਕਸ","ਦੌੜ","ਮੀਟਰ","ਗੀਤ","ਗਾਇਆ","ਕੌਰ","ਰਣਜੀਤ","ਸਦੀਕ","ਮੁਹੰਮਦ","ਨੋਟ","ਅਸਥਿਰਾਂਕਾਂ","ਅਜੋਕੇ","ਸੋਮਾ","ਗਹਿਰੀ","ਗੁਣਾਤਮਿਕ","ਵਜਾਏ","ਵਿਧੀਆਂ","ਅਨੁਮਾਨਾਂ","ਗਿਣਾਤਮਿਕ","ਹੁਣ","ਬਣਾਉਂਦੀ","ਸ਼ੋਧਾਂ","ਜਿਮੇਵਾਰ","ਤੱਥ","ਅਨੰਤ","ਗਿਣਤੀ","ਕਲਰਾਂ","ਐਕਸਪੈਂਸ਼ਨ","ਸਕੀਮ","ਸੰਖੇਪਤਾ","ਪਛਾਣੀ","ਚੰਗੀ","ਮਕੈਨਿਕਸ","ਸਟੈਟਿਸਟੀਕਲ","ਥਿਊਰੀ","ਫੀਲਡ","ਕੁਆਂਟਮ","ਖੇਡਦਾ","ਬੁਨ੍ਦੇਸਲੀਗ","ਅਧਾਰਤ","ਸਟੇਡੀਅਮ","ਰਾਈਨਐਨਰਜੀ","ਸਥਿੱਤ","ਵਿਖੇ","ਮਸ਼ਹੂਰ","ਕਲਨ","ਕਲੱਬ","ਫੁੱਟਬਾਲ","ਲੇਬਲ","ਰਿਕਾਰਡ","ਕੁਰਬਾਨ","ਬਹਾਲ","ਨੈਟਵਰਕ","ਪਰਵੇਸ਼","ਰੁਕਾਵਟ","ਕੋਰਬੇਨਿਕ","ਆਤਮਾਵਾਂ","ਘੇਰਾ","ਕੋਰਬੇਨੀਕ","ਫਾਈਨਲ","ਬ੍ਰੇਸਲੇਟ","ਦੂਰ","ਬਰੈਸਲੇਟ","ਯਾਦ","ਹਾਰਾਲਡ","ਸ਼ੈਡੋ","ਸੰਕੇਤ","ਸਲਾਹ","ਹਅਰਵਿਕ","ਹਰਲਡ","ਰਹੇ","ਜੀਅ","ਰਾਹੀਂ","ਅਵਤਾਰਾਂ","ਏਰਿਆ","ਅਗਲਾ","ਤਾਰਵੋਸ","ਓਪਰੇਸ਼ਨ","ਕਰਦੀ।","ਬੰਦ","ਤਾਜ਼ਾ","ਬਣਦੀ","ਮਜ਼ਬੂਤ","ਕੁਬਿਆ","ਦੌਰਾਨ","ਬੇਸਬਰੇ","ਮਾਛਾ","ਸ਼ਰਾਪ","ਮੀਆ","ਮਿਆ","ਮੈਂਬਰ","ਤਲ","ਕਾਪੀ","ਨੈਟ","ਸਮੱਸਿਆ","ਅਸਥਿਰ","ਵਧਦੀ","ਸਰਵਰ","ਵਰਤਮਾਨ","ਵੇਖਦਾ","ਪੂਲ","ਸੰਸਾਧਨਾਂ","ਗੋਰਰੇ","ਸੰਚਾਲਨ","ਸਹਿਮਤ","ਸ਼ਰਾਪਾਂ","ਹੋਰਨਾਂ","ਪਿੰਗਲਾ","ਕਿਊਬੀਆ","ਲਿਓਸ","ਮੀਟਿੰਗ","ਖੁਲਾਸੇ","ਖਰਾਬ","ਨਾਪਾਕ","ਦੁਨੀਆਂ","ਨਤੀਜਾ","ਤਬਾਹ","ਫਿੱਡਚੇਲ","ਮੌਨਸਟਰ","ਯੋਜਨਾ","ਸਰਾਪ","ਵਧਾਉਣ","ਹੈਲਬਾ","ਜ਼ਰੀਏ","ਨਵੇਂ","ਨਿਰਣੇ","ਦੋਨਾਂ","ਸੁਗੰਧਿਤ","ਹਾਲਤਾਂ","ਦੱਸਦਾ","ਖਤਮ","ਖੁਦ","ਅਹਿਸਾਸ","ਫੈਲ","ਸੰਸਾਰ","ਲੱਗਦਾ","ਪਰਤਦੇ","ਟਾਊਨ","ਸ਼ਰਾਰਤੀ","ਮੈਗੁਸ","ਭਟਕਣ","ਫਿਰਦੌਸ","ਜਗ੍ਹਾ","ਮਦਦ","ਸਲੱਮ","ਨੈੱਟ","ਧਿਰ","ਵਿਰੋਧੀ","ਦਿਖਾਈ","ਐਪੀਟੀਫਾ","ਟਕਰਾਓਮ","ਵੇਵ","ਕਵਰਡ","ਇਨਸ","ਸੁਝਾਅ","ਭਟਕ","ਸੰਪਰਕ","ਸੰਖੇਪ","ਲੱਦਣਾਂ","ਦੁਹਰਾਉਂਦੇ","ਮੁਸ਼ਕਲ","ਮਿਲਣ","ਏਆਈ","ਦੱਸਦੀ","ਆਰਾ","ਹਰਾਉਣ","ਸ਼ਕਤੀ","ਸਮਾਨ","ਇਨੀਸ","ਭੇਜਦਾ","ਦਿਵਾਉਂਦਾ","ਯਕੀਨ","ਪਰਖਣ","ਦਖਲ","ਹੇਲਾਬਾ","ਫੇਲ","ਐਨਕ੍ਰਿਪਟ","ਟਾਈਟਲਾਈਟ","ਮਿਟਾਉਣ","ਕਾਈਟ","ਘੋਸ਼ਿਤ","ਗ਼ੈਰਕਾਨੂੰਨੀ","ਬਰੇਸਲੈੱਟ","ਕਥਾ","ਪ੍ਰਸ਼ਾਸਕ","ਐਨਕੌਨਮੈਂਟ","ਬਚਦੇ","ਕਿਊਬਿਆ","ਬਦਲ","ਵੱਡੇ","ਹਰਾਉਂਦੇ","ਸਫੈਥ","ਰੱਖਿਆ।","ਓਰਾਕਾ","ਸਕੈਥ","ਛਾਣਬੀਣ","ਲੀਡਾਂ","ਫੈਸਲਾ","ਸਹਿਯੋਗ","ਕੋਮਾ"]

eklem commented 6 years ago

Cool, here is the stopword-trainer results from 32000 documents:

["ਦੇ","0","ਹੈ।","ਵਿੱਚ","ਦਾ","ਅਤੇ","ਦੀ","ਇੱਕ","ਨੂੰ","ਹੈ","ਤੋਂ","ਇਸ","ਇਹ","ਨੇ","ਤੇ","ਨਾਲ","1","ਲਈ","ਵੀ","ਸੀ।","ਹਨ।","ਸੀ","ਵਿਚ","ਕਿ","ਜੋ","ਉਹ","ਉਸ","ਹਨ","ਜਾਂਦਾ","ਕੀਤਾ","2","ਗਿਆ","ਹੀ","ਕੇ","ਜਾਂ","ਦੀਆਂ","ਜਿਸ","ਕਰਨ","ਹੋ","ਕਰ","ਆਪਣੇ","ਕੀਤੀ","ਤੌਰ","ਬਾਅਦ","ਨਹੀਂ","ਭਾਰਤੀ","ਪਿੰਡ","3","ਸਿੰਘ","ਉੱਤੇ","ਸਾਲ","।","ਪੰਜਾਬ","ਸਭ","ਭਾਰਤ","ਉਨ੍ਹਾਂ","ਹੁੰਦਾ","ਤੱਕ","ਇਕ","ਹੋਇਆ","ਜਨਮ","ਬਹੁਤ","ਪਰ","ਸਨ।","ਦੁਆਰਾ","ਰੂਪ","4","ਹੋਰ","ਕੰਮ","ਆਪਣੀ","ਤਾਂ","ਸਮੇਂ","ਪੰਜਾਬੀ","ਗਈ","ਦਿੱਤਾ","ਦੋ","ਕਿਸੇ","ਕਈ","ਜਾ","ਵਾਲੇ","ਸ਼ੁਰੂ","5","ਉਸਨੇ","ਗਿਆ।","ਕਿਹਾ","ਹੋਣ","ਲੋਕ","ਜਾਂਦੀ","ਵਿੱਚੋਂ","ਨਾਮ","ਕੀਤਾ।","ਜਦੋਂ","ਪਹਿਲਾਂ","ਕਰਦਾ","ਹੁੰਦੀ","ਹੋਏ","ਸਨ","ਵਜੋਂ","ਰਾਜ","ਕੀਤੀ।","ਮੁੱਖ","ਕਰਦੇ","ਕੁਝ","ਸਾਰੇ","ਹੁੰਦੇ","ਸ਼ਹਿਰ","ਭਾਸ਼ਾ","6","ਹੋਈ","ਅਨੁਸਾਰ","ਸਕਦਾ","ਆਮ","ਵੱਖ","ਕੋਈ","ਵਾਰ","ਗਏ","ਖੇਤਰ","ਜੀ","ਕਾਰਨ","ਕਰਕੇ","ਹੋਇਆ।","ਜਿਵੇਂ","ਜ਼ਿਲ੍ਹੇ","ਲੋਕਾਂ","ਚ","ਸਾਹਿਤ","ਸਦੀ","ਬਾਰੇ","ਜਾਂਦੇ","ਵਾਲਾ","ਜਾਣ","ਪਹਿਲੀ","ਪ੍ਰਾਪਤ","ਰਿਹਾ","ਵਾਲੀ","ਨਾਂ","ਦੌਰਾਨ","ਤਰ੍ਹਾਂ","7","ਯੂਨੀਵਰਸਿਟੀ","ਨਾ","ਏ","ਤਿੰਨ","ਇਨ੍ਹਾਂ","ਗੁਰੂ","ਇਸਨੂੰ","ਇਹਨਾਂ","ਪਿਤਾ","ਲਿਆ","ਸ਼ਾਮਲ","ਸ਼ਬਦ","ਅੰਗਰੇਜ਼ੀ","ਉਸਨੂੰ","ਉਹਨਾਂ","8","ਸਥਿਤ","ਫਿਰ","ਜੀਵਨ","ਸਕੂਲ","ਹੁਣ","ਦਿਨ","ਕੀਤੇ","ਆਦਿ","ਵੱਧ","ਲੈ","ਘਰ","ਵੱਲ","ਦੇਸ਼","ਵਲੋਂ","ਬਣ","ਵੀਂ","ਫਿਲਮ","ਉਮਰ","ਬਲਾਕ","ਰਹੇ","10","ਸਾਹਿਬ","ਕਰਦੀ","ਹਰ","ਪੈਦਾ","ਘੱਟ","9","ਲੇਖਕ","ਹਿੱਸਾ","ਫ਼ਿਲਮ","ਮੌਤ","ਜਿੱਥੇ","ਵੱਡਾ","ਵਿਖੇ","ਆਪਣਾ","ਪਹਿਲਾ","ਵਰਤੋਂ","ਗਈ।","ਆਪ","ਕਰਨਾ","ਵਿਆਹ","ਰਹੀ","ਰਾਹੀਂ","ਦਿੱਤੀ","ਉਸਦੇ","ਪਰਿਵਾਰ","ਆ","20","ਦੂਜੇ","ਅਮਰੀਕਾ","ਮੰਨਿਆ","ਇਸਦੇ","ਈ","ਕਾਲਜ","ਸਰਕਾਰ","ਇੱਥੇ","ਪਾਕਿਸਤਾਨ","ਸ਼ਾਮਿਲ","ਵਿਗਿਆਨ","ਉਸਦੀ","ਪੇਸ਼","ਕਿਉਂਕਿ","ਪਹਿਲੇ","ਧਰਮ","ਦਿੱਤਾ।","ਮਸ਼ਹੂਰ","ਅੰਦਰ","12","ਵਿਚੋਂ","ਜਿਨ੍ਹਾਂ","ਜਾਣਿਆ","ਪਾਣੀ","ਇਲਾਵਾ","ਅਰਥ","ਚਾਰ","ਪ੍ਰਸਿੱਧ","ਨਾਵਲ","ਵੱਡੇ","ਵੱਲੋਂ","ਕਹਾਣੀ","ਵਿਸ਼ਵ","ਮੂਲ","ਅਮਰੀਕੀ","ਸਥਾਨ","ਇਤਿਹਾਸ","11","ਕੁੱਝ","ਵਿਕਾਸ","ਉੱਤਰ","ਸਿੱਖਿਆ","ਹਿੰਦੀ","ਪ੍ਰਮੁੱਖ","ਰਚਨਾ","ਗਏ।","ਬਣਾਇਆ","ਵਿਸ਼ੇਸ਼","15","ਡਾ","ਉੱਪਰ","ਪੱਛਮੀ","ਦੇਣ","ਇਸਦਾ","ਸਕਦੇ","ਰੱਖਿਆ","ਕਵੀ","ਦਿੱਲੀ","ਵੱਡੀ","ਭੂਮਿਕਾ","ਸਮਾਜ","ਕਾਵਿ","ਕੀ","ਕੋਲ","ਦ","ਗੱਲ","ਸੰਸਾਰ","ਭਾਗ","ਆਈ","ਦੱਖਣ","ਅੱਜ","ਸਿੱਖ","ਕਹਿੰਦੇ","ਸੰਗੀਤ","ਕਿਲੋਮੀਟਰ","ਜਿਹਨਾਂ","ਸਭਾ","ਜਿਸਦਾ","ਜਨਵਰੀ","13","ਕਵਿਤਾ","ਮੈਂਬਰ","ਲਿਖਿਆ","ਮਾਂ","ਕਲਾ","ਪੰਜ","ਥਾਂ","ਹੇਠ","ਜਿਆਦਾ","ਵਰਤਿਆ","ਮਾਰਚ","ਡੀ","ਅਕਤੂਬਰ","14","19","ਤਕ","16","ਨਾਟਕ","ਬੀ","ਖਾਸ","ਇਸੇ","ਆਧੁਨਿਕ","ਅਗਸਤ","ਤਿਆਰ","ਮਾਤਾ","18","ਬਣਾਉਣ","ਨਵੰਬਰ","ਵਿਅਕਤੀ","ਦੱਖਣੀ","ਦਸੰਬਰ","ਆਫ","ਗੀਤ","ਗਿਣਤੀ","ਕਾਲ","ਖੋਜ","ਸਾਲਾਂ","ਪੂਰੀ","ਸਮਾਂ","ਜ਼ਿਆਦਾ","ਇਸਦੀ","ਸਕਦੀ","ਵਿਚਕਾਰ","ਰਾਜਧਾਨੀ","30","ਉਸਦਾ","ਲਿਆ।","ਜੁਲਾਈ","ਹੋਈ।","ਜੂਨ","ਅਧੀਨ","ਸਥਾਪਨਾ","ਸੇਵਾ","ਭਾਵ","ਵਰਗ","ਛੋਟੇ","ਦਿੰਦਾ","ਸਮਾਜਿਕ","ਹੁੰਦੀਆਂ","ਟੀਮ","ਔਰਤਾਂ","ਅਕਸਰ","ਪ੍ਰਕਾਸ਼ਿਤ","17","ਉਰਦੂ","ਰੰਗ","ਪਾਰਟੀ","ਬਣਾ","ਪ੍ਰਭਾਵ","ਸ਼ੁਰੂਆਤ","ਲਗਭਗ","ਮਈ","ਸਿਰਫ","ਨੇੜੇ","ਜਿਸਨੂੰ","ਹਾਲਾਂਕਿ","ਦੂਰ","ਸਤੰਬਰ","ਕਿਤਾਬ","2011","ਕਦੇ","n","ਉੱਤਰੀ","ਪ੍ਰਕਾਰ","ਇਸਨੇ","ਪ੍ਰਦੇਸ਼","ਅੱਗੇ","ਸੰਯੁਕਤ","ਪੜ੍ਹਾਈ","ਵਧੇਰੇ","ਨਾਲ਼","ਮਨੁੱਖ","000","ਬਾਕੀ","ਪ੍ਰਧਾਨ","ਦੂਜੀ","ਕੁੱਲ","ਆਫ਼","ਅਧਿਐਨ","ਰਾਸ਼ਟਰੀ","ਪੁੱਤਰ","ਅੰਤਰਰਾਸ਼ਟਰੀ","ਧਰਤੀ","ਕੇਂਦਰ","ਦੇਸ਼ਾਂ","ਮੱਧ","ਜ਼ਿਲ੍ਹਾ","ਸਾਰੀਆਂ","ਪੱਧਰ","2012","ਹੋਵੇ",ਜੇ","ਭਾਈ","ਰਹਿਣ","ਪੁਰਸਕਾਰ","ਸਭਿਆਚਾਰ","ਪਤਾ","ਪਾਸੇ","ਨਵੇਂ","ਕੰਪਨੀ","ਬਾਹਰ","ਵੇਲੇ","ਸੰਨ","25","ਪੂਰਬੀ","ਵਿਚਾਰ","e","ਕਾਰਜ","ਪੀ","ਮਹੱਤਵਪੂਰਨ","ਦੁਨੀਆਂ","ਧਾਰਮਿਕ","ਮਨੁੱਖੀ","ਸਮੂਹ","ਅਜਿਹੇ","ਲਾਲ","ਦੂਜਾ","ਭਰਾ","ਸ੍ਰੀ","ਅੰਤ","ਜਾਂਦੀਆਂ","i","ਸ਼ਾਹ","ਰਹਿੰਦੇ","ਮਹਾਨ","ਚੀਨ","ਮੀਟਰ","ਵਰਗੇ","ਨਾਲੋਂ","ਹਾਸਲ","ਕਿਸਮ","ਅਜਿਹਾ","ਬਣਿਆ","ਭਰ","ਛੱਡ","ਲੈਣ","ਹਿੱਸੇ","29","ਟੀ","ਲਿਖੇ","ਮਿਲ","ਮੌਜੂਦ","ਦਿੱਤੇ","ਵਾਸਤੇ","ਰਿਹਾ।","ਵਾਲੀਆਂ","ਵਧੀਆ","ਰੂਸੀ","ਜਾਰੀ","ਸਰਕਾਰੀ","ਡਿਗਰੀ","2014","ਪੱਛਮ","ਲੜਾਈ","ਭਾਸ਼ਾਵਾਂ","ਰਾਜਾ","the","ਜਲੰਧਰ","ਹਿੰਦੂ","ਔਰਤ","ਜੰਗ","ਬਾਬਾ","ਬੱਚਿਆਂ","ਮੰਤਰੀ","ਪਟਿਆਲਾ","ਵਾਂਗ","a","ਆਉਣ","ਭਾਵੇਂ","ਕੇਵਲ","21","ਐਸ","ਪ੍ਰਾਚੀਨ","ਰਹਿੰਦਾ","ਬੋਲੀ","ਅਵਾਰਡ","ਨਗਰ","ਖੇਡਾਂ","ਫਿਲਮਾਂ","ਬੱਚੇ","ਕੌਰ","ਤੋ","ਪ੍ਰਤੀ","ਕੁਆਂਟਮ","ਅਬਾਦੀ","ਪੁਸਤਕ","ਐਮ","ਰਾਮ","ਖੇਤਰਾਂ","ਫਰਵਰੀ","ਕ੍ਰਿਕਟ","ਪੈਂਦਾ","ਇਤਿਹਾਸਕ","ਲੱਗ","ਬ੍ਰਿਟਿਸ਼","ਆਇਆ","ਮਿਲਦਾ"]

eklem commented 6 years ago

Also, should we leave the numbers 0-9 in the list?

manmeet3591 commented 6 years ago

Sorry I was out for a day, the output from the stopword-trainer looks good and we should have 0-9 in the list

eklem commented 6 years ago

No stress :smile: I'll create a test and add it to the library.

eklem commented 6 years ago

Now it's published as v.0.1.13. I removed the words that had attached at the end and checked that the words without was available in the stopword lists. For the list to be even a little better I could remove all the at the end of words so the calculation is fully correct.

Also removed some a-z letters. Some were english single character words, and some left over from crawling text. Stuff like "new line" etc.

manmeet3591 commented 6 years ago

Thats great @eklem

eklem commented 6 years ago

Thanks for your work, @manmeet3591 ! And let me know if you have any issues with the stopword-list.