aberba / twi-synonyms

Synonyms of English words in Twi
GNU General Public License v3.0
1 stars 1 forks source link

Write scripts to extract Twi words from files #1

Open aberba opened 4 years ago

aberba commented 4 years ago

The goal is to have scripts available to automate extraction of Twi words from documents (PDF, HTML, docx, markdown, etc) and write them into a new file. Each word per line.

Once we have a large enough database of Twi words, we can then write software to host all that data and enable regular people to suggest English translations. See the README of this repository for how it final synonyms database will look like.

Example:

kofi, ama, Kɔla, adwuma

to


kofi
ama
Kɔla
adwuma
``
otengkwame commented 4 years ago

I like this idea, but PHP is my core language. I know it can be done with it though. My big issue here is, how do you differentiate a twi word from an english word in a PDF, HTML or docx file.

aberba commented 4 years ago

You can't, I'm actually writing another code with gui to do the actual filtering. For now we just need the word. As many words as possible. See htps://JW.org/tw for some words to copy. Copy everything from each page, paste all into a file and run script on it to extract the words. Then submit the words as a pull request in preferable a markdown file.

aberba commented 4 years ago

You can use this D programming language code to do that...PHP can be hard for such task. Install the D compiler and run this code on any file. See here for how to install the compiler https://github.com/aberba/learn-coding

import std;
void main()
{
    auto dest = File("output.md", "w");

    readText("./input.md")
        .splitter
        .each!(word => dest.writeln(word));
}
otengkwame commented 4 years ago

@aberba I think you did not get my question. I was thinking you wanted a script that will extract twi words from documents and I wanted to know how to differentiate twi words from other words in other languages.

If it is about splitting words or getting words from text, again php is the simplest. For the record I have been developing with PHP for 8 years now.

    $text  = "a bunch of words";
    $list_words = explode(" ", $text);
aberba commented 4 years ago

@otengkwame then write them into a new markdown file...each word per line. Just like I demo-ed in the code.

You'll get what I mean by PHP can be hard for such task soon. More like it'll not be suited at some point in time ...due to its design. We'll be doing lots of unicode string handling and stuff going forward. I also have a past in PHP.

Currently it should do though.