Open aberba opened 4 years ago
I like this idea, but PHP is my core language. I know it can be done with it though. My big issue here is, how do you differentiate a twi word from an english word in a PDF, HTML or docx file.
You can't, I'm actually writing another code with gui to do the actual filtering. For now we just need the word. As many words as possible. See htps://JW.org/tw for some words to copy. Copy everything from each page, paste all into a file and run script on it to extract the words. Then submit the words as a pull request in preferable a markdown file.
You can use this D programming language code to do that...PHP can be hard for such task. Install the D compiler and run this code on any file. See here for how to install the compiler https://github.com/aberba/learn-coding
import std;
void main()
{
auto dest = File("output.md", "w");
readText("./input.md")
.splitter
.each!(word => dest.writeln(word));
}
@aberba I think you did not get my question. I was thinking you wanted a script that will extract twi words from documents and I wanted to know how to differentiate twi words from other words in other languages.
If it is about splitting words or getting words from text, again php is the simplest. For the record I have been developing with PHP for 8 years now.
$text = "a bunch of words";
$list_words = explode(" ", $text);
@otengkwame then write them into a new markdown file...each word per line. Just like I demo-ed in the code.
You'll get what I mean by PHP can be hard for such task
soon. More like it'll not be suited at some point in time ...due to its design. We'll be doing lots of unicode string handling and stuff going forward. I also have a past in PHP.
Currently it should do though.
The goal is to have scripts available to automate extraction of Twi words from documents (PDF, HTML, docx, markdown, etc) and write them into a new file. Each word per line.
Once we have a large enough database of Twi words, we can then write software to host all that data and enable regular people to suggest English translations. See the README of this repository for how it final synonyms database will look like.
Example:
to