datameet-pune / datameet-pune.github.io

Common repo and documentation space for DataMeet Pune chapter
https://sites.google.com/view/datameetpune/home
GNU General Public License v3.0
16 stars 20 forks source link

project: Adding Indian place names to Spellcheck Dictionaries #5

Open answerquest opened 6 years ago

answerquest commented 6 years ago

Where this is coming from: https://etherpad.net/p/LibreOffice-Hackathon-Gnunify 17 Feb Gnunify 2018 event: Session on hacking LibreOffice conducted by @geekgod where we talked about this.

Initial task list:

  1. District Census Handbook page: http://www.censusindia.gov.in/2011census/dchb/DCHB.html
  2. Download excel files for each state under "Town Amenities" and "Village Amenities" headings.
  3. Find the worksheet & column for a. Districts , b. Sub-districts. And if desired, c. Towns, and d. Villages.
  4. Extract the data. Take care to exclude headers.
  5. Remove duplicates.
  6. Remove artefacts like "(MC)", hyphens, asterisk etc.
  7. Isolate entries having multiple words and figure out what to do with them. One option is to add those words in distinct entries, and remove the duplicates.
  8. Diff with existing dictionary to get the place words that aren't present in dictionary.
  9. Push this list to update the dictionary on LibreOffice and possibly other places.