Closed eklem closed 6 years ago
Use the new stopword-trainer. Still needs some help from someone with good punjabi knowledge, verifying the result.
Check if wikipedia.org has articles in Punjabi and test if wikiminer can help getting the content.
So, Punjabi is split in two different alphabets: Gurmukhi and Shahmukhi.
They each have their own Wikipedia site, so it'll be possible to use the wikipedia-stopword-crawler to create a stopword list for each:
Let me know if you need this. It will happen, but it will happen sooner with someone needing it 😃
Hi Eklem
I know the gurmukhi script and Punjabi language as a native speaker, and I work at the intersection of deep learning and atmospheric science. I came here from the article https://medium.com/norch/input-text-ouput-stopwords-bf40f4f22900 Please let me know how I can contribute in the code and the language. I would be happy to be a part of both.
Regards Manmeet
Cool, @manmeet3591. The most important is to verify that the outcome is words of little meaning. And of course that somebody needs this. I'll start with crawling some Wikipedia articles.
Thats awesome @eklem I understand that the outcome of words needs to be meaningful and I would give time to it. Additionally it would be of great use to many people, Let me know if we can do more on the code side as well.
Nice, I'll just verify my process tonight with a Norwegian dataset to see that everything still works as intended. The stopword-trainer
is at it as we speak. Then I'll start tomorrow on tweaking my wikipedia-crawlers to match the Gurmukhi Wikipedia-site and possibly start crawling Wikipedia article URLs. When that's done, I need to crawl the pages and run the stopword-trainer.
And when that's done, I need your help to verify 😄
Perfect :)
next page
-check can be done on "ਅਗਲਾ ਸਫ਼ਾ"
Yes it can be done on "ਅਗਲਾ ਸਫ਼ਾ"
So, a little status. Have some problem matching "Next page" (in gurmukhi) with what I find in the HTML. I'll get to that tomorrow.
Before trying to crawl, I managed to weed out a bug in the stopword-trainer
so it now tackles more than a-z letters, and hopefully gurmukhi. I've added tests for Norwegian language.
So, tomorrow, fix the "next page"-issue and crawl url's.
Good going @eklem
Got the URL crawler going, so now I have little over a 30000 URLs to crawl later. Will look into that tomorrow.
is the url crawler still going on ?
It's finished and looking good. I was sick yesterday, so didn't get anything done then. The crawler for content just needs a small tweak I think, and I'll hopefully get that going today.
Here's the output from the stopword-trainer after crawling 30 docs (have urls for approximately 30000 docs) and selecting the 500 most used words:
["ਹੈ।","ਦਾ","ਦੇ","ਹੈ","ਵਿੱਚ","ਅਤੇ","ਇੱਕ","ਨੂੰ","ਇਹ","ਲਈ","ਤੋਂ","ਦੀ","ਸਾਲ","ਇਸ","ਹਨ।","ਨਾਲ","ਦਿਨ","ਡੋਮੇਨ","ਜੋ","ਜਾਂਦਾ","ਤੇ","ਹਨ","0","ਹੁੰਦਾ","ਵਾਂ","ਦੁਆਰਾ","ਕਰਨ","1","ਵੀ","ਲੈੱਵਲ","ਕਿ","ਜਿਸ","ਪਰ","ਮੁਤਾਬਕ","ਕਲੰਡਰ","ਬਾਕੀ","ਕੋਡ","ਰਜਿਸਟਰੀ","।","ਕੀਤਾ","the","in","ਵਿਚ","ਦੇਸ਼","ਲੀਪ","ਗ੍ਰੈਗਰੀ","010","ਜਾਂ","ਦੀਆਂ","ਨੇ","ਨਹੀਂ","ਕਰ","ਉਹ","ਗਿਆ","ਇਕ","ਸਕਦਾ","ਖੇਡ","2","ਸ਼ੁਰੂ","of","ਬਾਅਦ","ਇੰਟਰਨੈੱਟ","ਟਾੱਪ","ਕੁਝ","ਉਸ","ਵਿਸ਼ਵ","and","to","level","ਹੋਰ","ਸੇਕੰਡ","ਸਦੀ","11","ਜਾ","ਕਰਦੇ","ਤੱਕ","ਵਰਤਿਆ","ਹੁੰਦੀ","ਹੋ","ਨਾਂ","top","ਸੀ।","ਜਾਂਦੀ","cctld","ਕੋਈ","ਸੀ","ਕੀਤੀ","a","or","is","ਗਏ","ਆਈ","domain","ਲੜੀ","ਅੰਦਰ","ਕਰਦੀ","ਆਪਣੇ","ਅੱਖਰ","ਕੰਪਿਊਟਰ","ਭਾਰਤ","ਜਿੱਥੇ","ਸਾਰੇ","ਵੀਂ","ਹੋਇਆ।","ਚਲਾਇਆ","ਸ਼ਾਮਲ","ਪਹਿਲਾਂ","ਇੰਟਰਨੈਟ","ਜਿਆਦਾ","ਯੂਨਾਈਟਡ","nic","ਜਿਵੇਂ","ਤੌਰ","ਕਿਉਂਕਿ","ਜਿਸਨੂੰ","ਖੇਡਣ","ਰਿਹਾ","ਕਾਰਨ","ਅਮਰੀਕਾ","ਏ","ਹੇਂਠ","ਹੇਠ","as","ਬਹੁਤ","ਕਰਕੇ","ਸ਼ਬਦ","ਮੂਲ","ਚਾਰ","ਦਹਾਕਾ","26","ਆਉਂਦਾ","ਪੱਧਰ","ਸਰਕਾਰ","ਕੇ","ਹੋਈ","ਜਰਮਨੀ","ਹੋਣ","ਰਜਿਸਟਰੀਆਂ","ਨਾਮ","ਦੋ","ਤਾਂ","ਅਧਾਰ","10","ਦ੍ਰਿਸ਼ਟੀਕੋਣ","ਕੰਮ","ਖ਼ਤਮ","ਕੇਂਦਰ","ਦੇਣ","ਗਈਆਂ","national","ਰੱਖੇ","ca","ਥੱਲੇ","ਐਨ","ਸੰਸਥਾਵਾਂ","ਕੀਤੀਆਂ","space","state","this","which","ਬਣ","ਖੇਤਰਾਂ","ਵਿਸ਼ੇਸ਼ਤਾਵਾਂ","ਮਿਲਦਾ","ਜਦੋਂ","ਬਣਦਾ","ਦੇਸੀ","ਸ਼ਨੀਵਾਰ","ਦੂਜੀ","0100","21","ਜਾਣੀ","ਟਰੈਕ","ਸਮਝ","ਜਰਮਨ","9","4","3","ਸਿੱਧੀ","ਲਾਈਨ","ਹਰ","ਅਪਰੈਲ","ਦਿੰਦੇ","ਹੋਏ","ਹੁੰਦੇ","ਸੁਧਾਰ","ਬੇਅੰਤ","ਆਫ","ਦੇਖ","ਕਿੱਤਾ","1985","ਰਾਸ਼ਟਰੀ","ਸਕਦੀ","ਜੀ","ਥਾਂ","ਇੱਥੇ","ਤਰਾਂ","ਕਈ","ਅੰਤਰਰਾਸ਼ਟਰੀ","names","many","city","be","was","original","other","from","ਅੰਤ","ਰੂਪ","ਮਹੱਤਵਪੂਰਨ","ਲੱਗਿਆ","ਸਥਾਨ","ਸੈੱਟ","ਜਾਣਕਾਰੀ","ਇਸਨੂੰ","ਵੱਡਾ","ਤੱਤ","ਉਲਟ","ਪਹਿਲੀ","ਆਲੋਚਕਾਂ","ਰੂਸੀ","235","131","130","ਮਈ","325","324","41","ਫ਼ਰਵਰੀ","ਕੱਤਕ","51","315","314","ਨਵੰਬਰ","ਮੱਘਰ","345","344","ਦਸੰਬਰ","174","192","191","ਜੁਲਾਈ","356","355","ਜਨਵਰੀ","ਸ਼ਾ","ਨਾ","ਪੋਹ","026","265","101","100","143","223","222","ਅਗਸਤ","82","284","283","ਅਕਤੂਬਰ","ਉਗਲੀਆਂ","ਮਨੁੱਖੀ","ਮੰਨਿਆ","ਦਸ਼ਮਲਵ","ਪਹਿਲਾ","ਸੰਖਿਆ","ਜਿਸਤ","ਪ੍ਰਕਿਰਤਿਕ","ਦਸ","1090","09","01099","06","01066","1060","01060","0106","0102","ਐਤਵਾਰ","1010","01010","ਗਤੀਵਿਧੀਆਂ","ਲੋਕਪ੍ਰਿਯ","l","ਦੌੜੀ","ਈਵੇਂਟ","ਅਥਲੈਟਿਕਸ","ਦੌੜ","ਮੀਟਰ","ਗੀਤ","ਗਾਇਆ","ਕੌਰ","ਰਣਜੀਤ","ਸਦੀਕ","ਮੁਹੰਮਦ","ਨੋਟ","cft","ads","ਅਸਥਿਰਾਂਕਾਂ","ਅਜੋਕੇ","ਸੋਮਾ","ਗਹਿਰੀ","ਗੁਣਾਤਮਿਕ","ਵਜਾਏ","ਵਿਧੀਆਂ","ਅਨੁਮਾਨਾਂ","ਗਿਣਾਤਮਿਕ","ਹੁਣ","ਬਣਾਉਂਦੀ","ਸ਼ੋਧਾਂ","ਜਿਮੇਵਾਰ","ਤੱਥ","ਅਨੰਤ","ਗਿਣਤੀ","ਕਲਰਾਂ","ਐਕਸਪੈਂਸ਼ਨ","n","ਸਕੀਮ","ਸੰਖੇਪਤਾ","ਪਛਾਣੀ","ਚੰਗੀ","ਮਕੈਨਿਕਸ","ਸਟੈਟਿਸਟੀਕਲ","ਥਿਊਰੀ","ਫੀਲਡ","ਕੁਆਂਟਮ","ਖੇਡਦਾ","ਬੁਨ੍ਦੇਸਲੀਗ","ਅਧਾਰਤ","ਸਟੇਡੀਅਮ","ਰਾਈਨਐਨਰਜੀ","ਸਥਿੱਤ","ਵਿਖੇ","ਮਸ਼ਹੂਰ","ਕਲਨ","ਕਲੱਬ","ਫੁੱਟਬਾਲ","01","ਲੇਬਲ","ਰਿਕਾਰਡ","71","disc","ਕੁਰਬਾਨ","ਬਹਾਲ","ਨੈਟਵਰਕ","ਪਰਵੇਸ਼","ਰੁਕਾਵਟ","ਕੋਰਬੇਨਿਕ","ਆਤਮਾਵਾਂ","ਘੇਰਾ","ਕੋਰਬੇਨੀਕ","ਫਾਈਨਲ","ਬ੍ਰੇਸਲੇਟ","ਦੂਰ","ਬਰੈਸਲੇਟ","ਯਾਦ","ਹਾਰਾਲਡ","ਸ਼ੈਡੋ","ਸੰਕੇਤ","ਸਲਾਹ","ਹਅਰਵਿਕ","ਹਰਲਡ","ਰਹੇ","ਜੀਅ","ਰਾਹੀਂ","ਅਵਤਾਰਾਂ","ਏਰਿਆ","ਅਗਲਾ","ਤਾਰਵੋਸ","ਓਪਰੇਸ਼ਨ","ਕਰਦੀ।","ਬੰਦ","ਤਾਜ਼ਾ","ਬਣਦੀ","ਮਜ਼ਬੂਤ","ਕੁਬਿਆ","ਦੌਰਾਨ","ਬੇਸਬਰੇ","ਮਾਛਾ","ਸ਼ਰਾਪ","ਮੀਆ","ਮਿਆ","ਮੈਂਬਰ","ਤਲ","ਕਾਪੀ","ਨੈਟ","ਸਮੱਸਿਆ","ਅਸਥਿਰ","ਵਧਦੀ","ਸਰਵਰ","ਵਰਤਮਾਨ","ਵੇਖਦਾ","27","ਪੂਲ","ਸੰਸਾਧਨਾਂ","ਗੋਰਰੇ","ਸੰਚਾਲਨ","ਸਹਿਮਤ","ਸ਼ਰਾਪਾਂ","ਹੋਰਨਾਂ","ਪਿੰਗਲਾ","ਕਿਊਬੀਆ","ਲਿਓਸ","ਮੀਟਿੰਗ","ਖੁਲਾਸੇ","25","ਖਰਾਬ","ਨਾਪਾਕ","ਦੁਨੀਆਂ","ਨਤੀਜਾ","ਤਬਾਹ","ਫਿੱਡਚੇਲ","ਮੌਨਸਟਰ","ਯੋਜਨਾ","ਸਰਾਪ","ਵਧਾਉਣ","ਹੈਲਬਾ","24","ਜ਼ਰੀਏ","ਨਵੇਂ","ਨਿਰਣੇ","ਦੋਨਾਂ","ਸੁਗੰਧਿਤ","ਹਾਲਤਾਂ","ਦੱਸਦਾ","23","ਖਤਮ","ਖੁਦ","ਅਹਿਸਾਸ","22","ਫੈਲ","ਸੰਸਾਰ","ਲੱਗਦਾ","ਪਰਤਦੇ","ਟਾਊਨ","ਸ਼ਰਾਰਤੀ","ਮੈਗੁਸ","ਭਟਕਣ","ਫਿਰਦੌਸ","ਜਗ੍ਹਾ","ਮਦਦ","ਸਲੱਮ","ਨੈੱਟ","ਧਿਰ","ਵਿਰੋਧੀ","ਦਿਖਾਈ","ਐਪੀਟੀਫਾ","ਟਕਰਾਓਮ","ਵੇਵ","ਕਵਰਡ","ਇਨਸ","ਸੁਝਾਅ","ਭਟਕ","ਸੰਪਰਕ","ਸੰਖੇਪ","ਲੱਦਣਾਂ","ਦੁਹਰਾਉਂਦੇ","ਮੁਸ਼ਕਲ","ਮਿਲਣ","ਏਆਈ","ਦੱਸਦੀ","ਆਰਾ","ਹਰਾਉਣ","ਸ਼ਕਤੀ","ਸਮਾਨ","ਇਨੀਸ","ਭੇਜਦਾ","19","ਦਿਵਾਉਂਦਾ","ਯਕੀਨ","ਪਰਖਣ","ਦਖਲ","ਹੇਲਾਬਾ","ਫੇਲ","ਐਨਕ੍ਰਿਪਟ","ਟਾਈਟਲਾਈਟ","ਮਿਟਾਉਣ","ਕਾਈਟ","18","ਘੋਸ਼ਿਤ","ਗ਼ੈਰਕਾਨੂੰਨੀ","ਬਰੇਸਲੈੱਟ","ਕਥਾ","ਪ੍ਰਸ਼ਾਸਕ","ਐਨਕੌਨਮੈਂਟ","17","ਬਚਦੇ","ਕਿਊਬਿਆ","ਬਦਲ","ਵੱਡੇ","ਹਰਾਉਂਦੇ","ਸਫੈਥ","ਰੱਖਿਆ।","ਓਰਾਕਾ","ਸਕੈਥ","ਛਾਣਬੀਣ","ਲੀਡਾਂ","ਫੈਸਲਾ","ਸਹਿਯੋਗ","ਕੋਮਾ"]
As you see its quite some English words there. We'll remove them manually. Also, because Wikipedia is an encyclopedia, there will be words often used that you actually bears meaning, and thus is not a stopword. We'll remove them manually too.
500 words is a lot, we can maybe make a list that is 200 or 250 words long. Since it's sorted on frequency in use, it can be sliced from the bottom to be less agressive.
If the result looks not all that bad, I'll start the process of crawling docs this evening.
Hi @eklem Yeah the result looks okay, you can start the process of crawling the docs. I am removing the english words and the list is as below । is fullstop in punjabi so i am removing that as well, we can consider it as a separate word i guess which the stopword-trainer has already taken
["ਹੈ","ਦਾ","ਦੇ","ਹੈ","ਵਿੱਚ","ਅਤੇ","ਇੱਕ","ਨੂੰ","ਇਹ","ਲਈ","ਤੋਂ","ਦੀ","ਸਾਲ","ਇਸ","ਹਨ।","ਨਾਲ","ਦਿਨ","ਡੋਮੇਨ","ਜੋ","ਜਾਂਦਾ","ਤੇ","ਹਨ","ਹੁੰਦਾ","ਵਾਂ","ਦੁਆਰਾ","ਕਰਨ","ਵੀ","ਲੈੱਵਲ","ਕਿ","ਜਿਸ","ਪਰ","ਮੁਤਾਬਕ","ਕਲੰਡਰ","ਬਾਕੀ","ਕੋਡ","ਰਜਿਸਟਰੀ","।","ਕੀਤਾ","the","ਵਿਚ","ਦੇਸ਼","ਲੀਪ","ਗ੍ਰੈਗਰੀ","ਜਾਂ","ਦੀਆਂ","ਨੇ","ਨਹੀਂ","ਕਰ","ਉਹ","ਗਿਆ","ਇਕ","ਸਕਦਾ","ਖੇਡ","ਸ਼ੁਰੂ","ਬਾਅਦ","ਇੰਟਰਨੈੱਟ","ਟਾੱਪ","ਕੁਝ","ਉਸ","ਵਿਸ਼ਵ","and","level","ਹੋਰ","ਸੇਕੰਡ","ਸਦੀ","ਜਾ","ਕਰਦੇ","ਤੱਕ","ਵਰਤਿਆ","ਹੁੰਦੀ","ਹੋ","ਨਾਂ","ਸੀ","ਜਾਂਦੀ","ਕੋਈ","ਸੀ","ਕੀਤੀ","ਗਏ","ਆਈ","ਲੜੀ","ਅੰਦਰ","ਕਰਦੀ","ਆਪਣੇ","ਅੱਖਰ","ਕੰਪਿਊਟਰ","ਭਾਰਤ","ਜਿੱਥੇ","ਸਾਰੇ","ਵੀਂ","ਹੋਇ","ਚਲਾਇਆ","ਸ਼ਾਮਲ","ਪਹਿਲਾਂ","ਇੰਟਰਨੈਟ","ਜਿਆਦਾ","ਯੂਨਾਈਟਡ","ਜਿਵੇਂ","ਤੌਰ","ਕਿਉਂਕਿ","ਜਿਸਨੂੰ","ਖੇਡਣ","ਰਿਹਾ","ਕਾਰਨ","ਅਮਰੀਕਾ","ਏ","ਹੇਂਠ","ਹੇਠ","ਬਹੁਤ","ਕਰਕੇ","ਸ਼ਬਦ","ਮੂਲ","ਚਾਰ","ਦਹਾਕਾ","ਆਉਂਦਾ","ਪੱਧਰ","ਸਰਕਾਰ","ਕੇ","ਹੋਈ","ਜਰਮਨੀ","ਹੋਣ","ਰਜਿਸਟਰੀਆਂ","ਨਾਮ","ਦੋ","ਤਾਂ","ਅਧਾਰ","ਦ੍ਰਿਸ਼ਟੀਕੋਣ","ਕੰਮ","ਖ਼ਤਮ","ਕੇਂਦਰ","ਦੇਣ","ਗਈਆਂ","ਰੱਖੇ","ਥੱਲੇ","ਐਨ","ਸੰਸਥਾਵਾਂ","ਕੀਤੀਆਂ","ਬਣ","ਖੇਤਰਾਂ","ਵਿਸ਼ੇਸ਼ਤਾਵਾਂ","ਮਿਲਦਾ","ਜਦੋਂ","ਬਣਦਾ","ਦੇਸੀ","ਸ਼ਨੀਵਾਰ","ਦੂਜੀ","ਜਾਣੀ","ਟਰੈਕ","ਸਮਝ","ਜਰਮਨ","ਸਿੱਧੀ","ਲਾਈਨ","ਹਰ","ਅਪਰੈਲ","ਦਿੰਦੇ","ਹੋਏ","ਹੁੰਦੇ","ਸੁਧਾਰ","ਬੇਅੰਤ","ਆਫ","ਦੇਖ","ਕਿੱਤਾ","ਰਾਸ਼ਟਰੀ","ਸਕਦੀ","ਜੀ","ਥਾਂ","ਇੱਥੇ","ਤਰਾਂ","ਕਈ","ਅੰਤਰਰਾਸ਼ਟਰੀ","ਅੰਤ","ਰੂਪ","ਮਹੱਤਵਪੂਰਨ","ਲੱਗਿਆ","ਸਥਾਨ","ਸੈੱਟ","ਜਾਣਕਾਰੀ","ਇਸਨੂੰ","ਵੱਡਾ","ਤੱਤ","ਉਲਟ","ਪਹਿਲੀ","ਆਲੋਚਕਾਂ","ਰੂਸੀ","ਫ਼ਰਵਰੀ","ਕੱਤਕ","ਨਵੰਬਰ","ਮੱਘਰ","ਦਸੰਬਰ","ਜੁਲਾਈ","ਜਨਵਰੀ","ਸ਼ਾ","ਨਾ","ਪੋਹ","ਅਗਸਤ","ਅਕਤੂਬਰ","ਉਗਲੀਆਂ","ਮਨੁੱਖੀ","ਮੰਨਿਆ","ਦਸ਼ਮਲਵ","ਪਹਿਲਾ","ਸੰਖਿਆ","ਜਿਸਤ","ਪ੍ਰਕਿਰਤਿਕ","ਦਸ","ਐਤਵਾਰ","ਗਤੀਵਿਧੀਆਂ","ਲੋਕਪ੍ਰਿਯ","l","ਦੌੜੀ","ਈਵੇਂਟ","ਅਥਲੈਟਿਕਸ","ਦੌੜ","ਮੀਟਰ","ਗੀਤ","ਗਾਇਆ","ਕੌਰ","ਰਣਜੀਤ","ਸਦੀਕ","ਮੁਹੰਮਦ","ਨੋਟ","ਅਸਥਿਰਾਂਕਾਂ","ਅਜੋਕੇ","ਸੋਮਾ","ਗਹਿਰੀ","ਗੁਣਾਤਮਿਕ","ਵਜਾਏ","ਵਿਧੀਆਂ","ਅਨੁਮਾਨਾਂ","ਗਿਣਾਤਮਿਕ","ਹੁਣ","ਬਣਾਉਂਦੀ","ਸ਼ੋਧਾਂ","ਜਿਮੇਵਾਰ","ਤੱਥ","ਅਨੰਤ","ਗਿਣਤੀ","ਕਲਰਾਂ","ਐਕਸਪੈਂਸ਼ਨ","ਸਕੀਮ","ਸੰਖੇਪਤਾ","ਪਛਾਣੀ","ਚੰਗੀ","ਮਕੈਨਿਕਸ","ਸਟੈਟਿਸਟੀਕਲ","ਥਿਊਰੀ","ਫੀਲਡ","ਕੁਆਂਟਮ","ਖੇਡਦਾ","ਬੁਨ੍ਦੇਸਲੀਗ","ਅਧਾਰਤ","ਸਟੇਡੀਅਮ","ਰਾਈਨਐਨਰਜੀ","ਸਥਿੱਤ","ਵਿਖੇ","ਮਸ਼ਹੂਰ","ਕਲਨ","ਕਲੱਬ","ਫੁੱਟਬਾਲ","ਲੇਬਲ","ਰਿਕਾਰਡ","ਕੁਰਬਾਨ","ਬਹਾਲ","ਨੈਟਵਰਕ","ਪਰਵੇਸ਼","ਰੁਕਾਵਟ","ਕੋਰਬੇਨਿਕ","ਆਤਮਾਵਾਂ","ਘੇਰਾ","ਕੋਰਬੇਨੀਕ","ਫਾਈਨਲ","ਬ੍ਰੇਸਲੇਟ","ਦੂਰ","ਬਰੈਸਲੇਟ","ਯਾਦ","ਹਾਰਾਲਡ","ਸ਼ੈਡੋ","ਸੰਕੇਤ","ਸਲਾਹ","ਹਅਰਵਿਕ","ਹਰਲਡ","ਰਹੇ","ਜੀਅ","ਰਾਹੀਂ","ਅਵਤਾਰਾਂ","ਏਰਿਆ","ਅਗਲਾ","ਤਾਰਵੋਸ","ਓਪਰੇਸ਼ਨ","ਕਰਦੀ।","ਬੰਦ","ਤਾਜ਼ਾ","ਬਣਦੀ","ਮਜ਼ਬੂਤ","ਕੁਬਿਆ","ਦੌਰਾਨ","ਬੇਸਬਰੇ","ਮਾਛਾ","ਸ਼ਰਾਪ","ਮੀਆ","ਮਿਆ","ਮੈਂਬਰ","ਤਲ","ਕਾਪੀ","ਨੈਟ","ਸਮੱਸਿਆ","ਅਸਥਿਰ","ਵਧਦੀ","ਸਰਵਰ","ਵਰਤਮਾਨ","ਵੇਖਦਾ","ਪੂਲ","ਸੰਸਾਧਨਾਂ","ਗੋਰਰੇ","ਸੰਚਾਲਨ","ਸਹਿਮਤ","ਸ਼ਰਾਪਾਂ","ਹੋਰਨਾਂ","ਪਿੰਗਲਾ","ਕਿਊਬੀਆ","ਲਿਓਸ","ਮੀਟਿੰਗ","ਖੁਲਾਸੇ","ਖਰਾਬ","ਨਾਪਾਕ","ਦੁਨੀਆਂ","ਨਤੀਜਾ","ਤਬਾਹ","ਫਿੱਡਚੇਲ","ਮੌਨਸਟਰ","ਯੋਜਨਾ","ਸਰਾਪ","ਵਧਾਉਣ","ਹੈਲਬਾ","ਜ਼ਰੀਏ","ਨਵੇਂ","ਨਿਰਣੇ","ਦੋਨਾਂ","ਸੁਗੰਧਿਤ","ਹਾਲਤਾਂ","ਦੱਸਦਾ","ਖਤਮ","ਖੁਦ","ਅਹਿਸਾਸ","ਫੈਲ","ਸੰਸਾਰ","ਲੱਗਦਾ","ਪਰਤਦੇ","ਟਾਊਨ","ਸ਼ਰਾਰਤੀ","ਮੈਗੁਸ","ਭਟਕਣ","ਫਿਰਦੌਸ","ਜਗ੍ਹਾ","ਮਦਦ","ਸਲੱਮ","ਨੈੱਟ","ਧਿਰ","ਵਿਰੋਧੀ","ਦਿਖਾਈ","ਐਪੀਟੀਫਾ","ਟਕਰਾਓਮ","ਵੇਵ","ਕਵਰਡ","ਇਨਸ","ਸੁਝਾਅ","ਭਟਕ","ਸੰਪਰਕ","ਸੰਖੇਪ","ਲੱਦਣਾਂ","ਦੁਹਰਾਉਂਦੇ","ਮੁਸ਼ਕਲ","ਮਿਲਣ","ਏਆਈ","ਦੱਸਦੀ","ਆਰਾ","ਹਰਾਉਣ","ਸ਼ਕਤੀ","ਸਮਾਨ","ਇਨੀਸ","ਭੇਜਦਾ","ਦਿਵਾਉਂਦਾ","ਯਕੀਨ","ਪਰਖਣ","ਦਖਲ","ਹੇਲਾਬਾ","ਫੇਲ","ਐਨਕ੍ਰਿਪਟ","ਟਾਈਟਲਾਈਟ","ਮਿਟਾਉਣ","ਕਾਈਟ","ਘੋਸ਼ਿਤ","ਗ਼ੈਰਕਾਨੂੰਨੀ","ਬਰੇਸਲੈੱਟ","ਕਥਾ","ਪ੍ਰਸ਼ਾਸਕ","ਐਨਕੌਨਮੈਂਟ","ਬਚਦੇ","ਕਿਊਬਿਆ","ਬਦਲ","ਵੱਡੇ","ਹਰਾਉਂਦੇ","ਸਫੈਥ","ਰੱਖਿਆ।","ਓਰਾਕਾ","ਸਕੈਥ","ਛਾਣਬੀਣ","ਲੀਡਾਂ","ਫੈਸਲਾ","ਸਹਿਯੋਗ","ਕੋਮਾ"]
Cool, here is the stopword-trainer results from 32000 documents:
["ਦੇ","0","ਹੈ।","ਵਿੱਚ","ਦਾ","ਅਤੇ","ਦੀ","ਇੱਕ","ਨੂੰ","ਹੈ","ਤੋਂ","ਇਸ","ਇਹ","ਨੇ","ਤੇ","ਨਾਲ","1","ਲਈ","ਵੀ","ਸੀ।","ਹਨ।","ਸੀ","ਵਿਚ","ਕਿ","ਜੋ","ਉਹ","ਉਸ","ਹਨ","ਜਾਂਦਾ","ਕੀਤਾ","2","ਗਿਆ","ਹੀ","ਕੇ","ਜਾਂ","ਦੀਆਂ","ਜਿਸ","ਕਰਨ","ਹੋ","ਕਰ","ਆਪਣੇ","ਕੀਤੀ","ਤੌਰ","ਬਾਅਦ","ਨਹੀਂ","ਭਾਰਤੀ","ਪਿੰਡ","3","ਸਿੰਘ","ਉੱਤੇ","ਸਾਲ","।","ਪੰਜਾਬ","ਸਭ","ਭਾਰਤ","ਉਨ੍ਹਾਂ","ਹੁੰਦਾ","ਤੱਕ","ਇਕ","ਹੋਇਆ","ਜਨਮ","ਬਹੁਤ","ਪਰ","ਸਨ।","ਦੁਆਰਾ","ਰੂਪ","4","ਹੋਰ","ਕੰਮ","ਆਪਣੀ","ਤਾਂ","ਸਮੇਂ","ਪੰਜਾਬੀ","ਗਈ","ਦਿੱਤਾ","ਦੋ","ਕਿਸੇ","ਕਈ","ਜਾ","ਵਾਲੇ","ਸ਼ੁਰੂ","5","ਉਸਨੇ","ਗਿਆ।","ਕਿਹਾ","ਹੋਣ","ਲੋਕ","ਜਾਂਦੀ","ਵਿੱਚੋਂ","ਨਾਮ","ਕੀਤਾ।","ਜਦੋਂ","ਪਹਿਲਾਂ","ਕਰਦਾ","ਹੁੰਦੀ","ਹੋਏ","ਸਨ","ਵਜੋਂ","ਰਾਜ","ਕੀਤੀ।","ਮੁੱਖ","ਕਰਦੇ","ਕੁਝ","ਸਾਰੇ","ਹੁੰਦੇ","ਸ਼ਹਿਰ","ਭਾਸ਼ਾ","6","ਹੋਈ","ਅਨੁਸਾਰ","ਸਕਦਾ","ਆਮ","ਵੱਖ","ਕੋਈ","ਵਾਰ","ਗਏ","ਖੇਤਰ","ਜੀ","ਕਾਰਨ","ਕਰਕੇ","ਹੋਇਆ।","ਜਿਵੇਂ","ਜ਼ਿਲ੍ਹੇ","ਲੋਕਾਂ","ਚ","ਸਾਹਿਤ","ਸਦੀ","ਬਾਰੇ","ਜਾਂਦੇ","ਵਾਲਾ","ਜਾਣ","ਪਹਿਲੀ","ਪ੍ਰਾਪਤ","ਰਿਹਾ","ਵਾਲੀ","ਨਾਂ","ਦੌਰਾਨ","ਤਰ੍ਹਾਂ","7","ਯੂਨੀਵਰਸਿਟੀ","ਨਾ","ਏ","ਤਿੰਨ","ਇਨ੍ਹਾਂ","ਗੁਰੂ","ਇਸਨੂੰ","ਇਹਨਾਂ","ਪਿਤਾ","ਲਿਆ","ਸ਼ਾਮਲ","ਸ਼ਬਦ","ਅੰਗਰੇਜ਼ੀ","ਉਸਨੂੰ","ਉਹਨਾਂ","8","ਸਥਿਤ","ਫਿਰ","ਜੀਵਨ","ਸਕੂਲ","ਹੁਣ","ਦਿਨ","ਕੀਤੇ","ਆਦਿ","ਵੱਧ","ਲੈ","ਘਰ","ਵੱਲ","ਦੇਸ਼","ਵਲੋਂ","ਬਣ","ਵੀਂ","ਫਿਲਮ","ਉਮਰ","ਬਲਾਕ","ਰਹੇ","10","ਸਾਹਿਬ","ਕਰਦੀ","ਹਰ","ਪੈਦਾ","ਘੱਟ","9","ਲੇਖਕ","ਹਿੱਸਾ","ਫ਼ਿਲਮ","ਮੌਤ","ਜਿੱਥੇ","ਵੱਡਾ","ਵਿਖੇ","ਆਪਣਾ","ਪਹਿਲਾ","ਵਰਤੋਂ","ਗਈ।","ਆਪ","ਕਰਨਾ","ਵਿਆਹ","ਰਹੀ","ਰਾਹੀਂ","ਦਿੱਤੀ","ਉਸਦੇ","ਪਰਿਵਾਰ","ਆ","20","ਦੂਜੇ","ਅਮਰੀਕਾ","ਮੰਨਿਆ","ਇਸਦੇ","ਈ","ਕਾਲਜ","ਸਰਕਾਰ","ਇੱਥੇ","ਪਾਕਿਸਤਾਨ","ਸ਼ਾਮਿਲ","ਵਿਗਿਆਨ","ਉਸਦੀ","ਪੇਸ਼","ਕਿਉਂਕਿ","ਪਹਿਲੇ","ਧਰਮ","ਦਿੱਤਾ।","ਮਸ਼ਹੂਰ","ਅੰਦਰ","12","ਵਿਚੋਂ","ਜਿਨ੍ਹਾਂ","ਜਾਣਿਆ","ਪਾਣੀ","ਇਲਾਵਾ","ਅਰਥ","ਚਾਰ","ਪ੍ਰਸਿੱਧ","ਨਾਵਲ","ਵੱਡੇ","ਵੱਲੋਂ","ਕਹਾਣੀ","ਵਿਸ਼ਵ","ਮੂਲ","ਅਮਰੀਕੀ","ਸਥਾਨ","ਇਤਿਹਾਸ","11","ਕੁੱਝ","ਵਿਕਾਸ","ਉੱਤਰ","ਸਿੱਖਿਆ","ਹਿੰਦੀ","ਪ੍ਰਮੁੱਖ","ਰਚਨਾ","ਗਏ।","ਬਣਾਇਆ","ਵਿਸ਼ੇਸ਼","15","ਡਾ","ਉੱਪਰ","ਪੱਛਮੀ","ਦੇਣ","ਇਸਦਾ","ਸਕਦੇ","ਰੱਖਿਆ","ਕਵੀ","ਦਿੱਲੀ","ਵੱਡੀ","ਭੂਮਿਕਾ","ਸਮਾਜ","ਕਾਵਿ","ਕੀ","ਕੋਲ","ਦ","ਗੱਲ","ਸੰਸਾਰ","ਭਾਗ","ਆਈ","ਦੱਖਣ","ਅੱਜ","ਸਿੱਖ","ਕਹਿੰਦੇ","ਸੰਗੀਤ","ਕਿਲੋਮੀਟਰ","ਜਿਹਨਾਂ","ਸਭਾ","ਜਿਸਦਾ","ਜਨਵਰੀ","13","ਕਵਿਤਾ","ਮੈਂਬਰ","ਲਿਖਿਆ","ਮਾਂ","ਕਲਾ","ਪੰਜ","ਥਾਂ","ਹੇਠ","ਜਿਆਦਾ","ਵਰਤਿਆ","ਮਾਰਚ","ਡੀ","ਅਕਤੂਬਰ","14","19","ਤਕ","16","ਨਾਟਕ","ਬੀ","ਖਾਸ","ਇਸੇ","ਆਧੁਨਿਕ","ਅਗਸਤ","ਤਿਆਰ","ਮਾਤਾ","18","ਬਣਾਉਣ","ਨਵੰਬਰ","ਵਿਅਕਤੀ","ਦੱਖਣੀ","ਦਸੰਬਰ","ਆਫ","ਗੀਤ","ਗਿਣਤੀ","ਕਾਲ","ਖੋਜ","ਸਾਲਾਂ","ਪੂਰੀ","ਸਮਾਂ","ਜ਼ਿਆਦਾ","ਇਸਦੀ","ਸਕਦੀ","ਵਿਚਕਾਰ","ਰਾਜਧਾਨੀ","30","ਉਸਦਾ","ਲਿਆ।","ਜੁਲਾਈ","ਹੋਈ।","ਜੂਨ","ਅਧੀਨ","ਸਥਾਪਨਾ","ਸੇਵਾ","ਭਾਵ","ਵਰਗ","ਛੋਟੇ","ਦਿੰਦਾ","ਸਮਾਜਿਕ","ਹੁੰਦੀਆਂ","ਟੀਮ","ਔਰਤਾਂ","ਅਕਸਰ","ਪ੍ਰਕਾਸ਼ਿਤ","17","ਉਰਦੂ","ਰੰਗ","ਪਾਰਟੀ","ਬਣਾ","ਪ੍ਰਭਾਵ","ਸ਼ੁਰੂਆਤ","ਲਗਭਗ","ਮਈ","ਸਿਰਫ","ਨੇੜੇ","ਜਿਸਨੂੰ","ਹਾਲਾਂਕਿ","ਦੂਰ","ਸਤੰਬਰ","ਕਿਤਾਬ","2011","ਕਦੇ","n","ਉੱਤਰੀ","ਪ੍ਰਕਾਰ","ਇਸਨੇ","ਪ੍ਰਦੇਸ਼","ਅੱਗੇ","ਸੰਯੁਕਤ","ਪੜ੍ਹਾਈ","ਵਧੇਰੇ","ਨਾਲ਼","ਮਨੁੱਖ","000","ਬਾਕੀ","ਪ੍ਰਧਾਨ","ਦੂਜੀ","ਕੁੱਲ","ਆਫ਼","ਅਧਿਐਨ","ਰਾਸ਼ਟਰੀ","ਪੁੱਤਰ","ਅੰਤਰਰਾਸ਼ਟਰੀ","ਧਰਤੀ","ਕੇਂਦਰ","ਦੇਸ਼ਾਂ","ਮੱਧ","ਜ਼ਿਲ੍ਹਾ","ਸਾਰੀਆਂ","ਪੱਧਰ","2012","ਹੋਵੇ",ਜੇ","ਭਾਈ","ਰਹਿਣ","ਪੁਰਸਕਾਰ","ਸਭਿਆਚਾਰ","ਪਤਾ","ਪਾਸੇ","ਨਵੇਂ","ਕੰਪਨੀ","ਬਾਹਰ","ਵੇਲੇ","ਸੰਨ","25","ਪੂਰਬੀ","ਵਿਚਾਰ","e","ਕਾਰਜ","ਪੀ","ਮਹੱਤਵਪੂਰਨ","ਦੁਨੀਆਂ","ਧਾਰਮਿਕ","ਮਨੁੱਖੀ","ਸਮੂਹ","ਅਜਿਹੇ","ਲਾਲ","ਦੂਜਾ","ਭਰਾ","ਸ੍ਰੀ","ਅੰਤ","ਜਾਂਦੀਆਂ","i","ਸ਼ਾਹ","ਰਹਿੰਦੇ","ਮਹਾਨ","ਚੀਨ","ਮੀਟਰ","ਵਰਗੇ","ਨਾਲੋਂ","ਹਾਸਲ","ਕਿਸਮ","ਅਜਿਹਾ","ਬਣਿਆ","ਭਰ","ਛੱਡ","ਲੈਣ","ਹਿੱਸੇ","29","ਟੀ","ਲਿਖੇ","ਮਿਲ","ਮੌਜੂਦ","ਦਿੱਤੇ","ਵਾਸਤੇ","ਰਿਹਾ।","ਵਾਲੀਆਂ","ਵਧੀਆ","ਰੂਸੀ","ਜਾਰੀ","ਸਰਕਾਰੀ","ਡਿਗਰੀ","2014","ਪੱਛਮ","ਲੜਾਈ","ਭਾਸ਼ਾਵਾਂ","ਰਾਜਾ","the","ਜਲੰਧਰ","ਹਿੰਦੂ","ਔਰਤ","ਜੰਗ","ਬਾਬਾ","ਬੱਚਿਆਂ","ਮੰਤਰੀ","ਪਟਿਆਲਾ","ਵਾਂਗ","a","ਆਉਣ","ਭਾਵੇਂ","ਕੇਵਲ","21","ਐਸ","ਪ੍ਰਾਚੀਨ","ਰਹਿੰਦਾ","ਬੋਲੀ","ਅਵਾਰਡ","ਨਗਰ","ਖੇਡਾਂ","ਫਿਲਮਾਂ","ਬੱਚੇ","ਕੌਰ","ਤੋ","ਪ੍ਰਤੀ","ਕੁਆਂਟਮ","ਅਬਾਦੀ","ਪੁਸਤਕ","ਐਮ","ਰਾਮ","ਖੇਤਰਾਂ","ਫਰਵਰੀ","ਕ੍ਰਿਕਟ","ਪੈਂਦਾ","ਇਤਿਹਾਸਕ","ਲੱਗ","ਬ੍ਰਿਟਿਸ਼","ਆਇਆ","ਮਿਲਦਾ"]
Also, should we leave the numbers 0-9 in the list?
Sorry I was out for a day, the output from the stopword-trainer looks good and we should have 0-9 in the list
No stress :smile: I'll create a test and add it to the library.
Now it's published as v.0.1.13. I removed the words that had ।
attached at the end and checked that the words without was available in the stopword lists. For the list to be even a little better I could remove all the ।
at the end of words so the calculation is fully correct.
Also removed some a-z letters. Some were english single character words, and some left over from crawling text. Stuff like "new line" etc.
Thats great @eklem
Thanks for your work, @manmeet3591 ! And let me know if you have any issues with the stopword-list.
Could base it on this paper, but not sure how the license situation is: http://ijoes.vidyapublications.com/paper/Vol8/15-Vol8.pdf