Improvement on Kata Beda-Sifat Dictionary

ans-4175 commented 2 years ago

This is beta version of password by kata benda-sifat https://github.com/ans-4175/password-ga/blob/main/src/libs/password-ga-kata.js#L1 Could be improved by vast array of words or random API get words

imballinst commented 2 years ago

Hey! I wish to attempt doing this issue. Here's my rough idea so far:

Use puppeteer + cheerio. The first is to go to the page, the second is to parse the HTML. I'm going to fetch the nouns from https://kbbi.kata.web.id/kelas-kata/kata-benda/ and the adjectives from https://kbbi.kata.web.id/kelas-kata/kata-sifat. Since it has a lot of pages (like, ~1k+ for nouns and ~300 for adjectives), I probably will take random page numbers and read the text content of article > dl > dt > a (as shown on the picture below):

The number of pages that we take, we have 2 options: either randomize it every deployment, or put it in an environment variable so it will be static (if we have a CD).
The words I will output as a JSON in the form of:

interface WordsDictionary {
  adjectives: string[];
  nouns: string[];
}

Then, we will either of: (1) read that JSON, or (2) serve that JSON in an API or something (because loading a big JSON/script will reduce the time-to-interactivity of a page).

I have yet to test (1), the most important part of this idea, though. I will do so in a bit.

WDYT @ans-4175?

ans-4175 commented 2 years ago

First uncertainties we would tackle, is how big that data would be? Assumed scraping could be done. time-to-interactivity of a page, what if we have a vast amount of data, then client app would "download" certain lists and store it in localstorage? So basically we could called it from localstorage, that would help lot on network transfer at first. User also can ask for more words to be download if needed.

ans-4175 commented 2 years ago

At first I want to make this app plain dead simple, but thinking of moving it to some kind fullstack framework. So that this app would have FE+BE in one code structure. If we are going to make this, we need to have another issues that tackle moving codebase from CRA to new one. I could do this, with Razzle as my first choice of creating fullstack app :)

ans-4175 commented 2 years ago

But I think to make PoC on this one, you could "scrape" certain list and put it in directly in password-ga-kata.js then check password strength and randomness. For me, localstorage seems good to go choices if needed.

ans-4175 commented 2 years ago

Also, we could put JSON in some cloud storage (like JSONBin) then fetch it gradually by each bucket. So that it would not affect first time to render of page and we don't need to move our codebase structure

imballinst commented 2 years ago

what if we have a vast amount of data, then client app would "download" certain lists and store it in localstorage? So basically we could called it from localstorage, that would help lot on network transfer at first. User also can ask for more words to be download if needed.

Yeah, that seems like a good option. By saving to local storage, for subsequent visits, users don't have to re-fetch the words anymore. One constraint that I can think of right now is that the max size of entire local storage is 5MB: https://developer.mozilla.org/en-US/docs/Web/API/Web_Storage_API -- in case it even reaches that big.

But I think to make PoC on this one, you could "scrape" certain list and put it in directly in password-ga-kata.js then check password strength and randomness. For me, localstorage seems good to go choices if needed.

Alright!

At first I want to make this app plain dead simple, but thinking of moving it to some kind fullstack framework. So that this app would have FE+BE in one code structure. If we are going to make this, we need to have another issues that tackle moving codebase from CRA to new one. I could do this, with Razzle as my first choice of creating fullstack app :)

Interesting, I just looked up on the Razzle examples. At the moment my go-to is Next.js, but if I compare it with Razzle, I take it that Razzle is more "composable", in some ways? Like, this file:

server
  .disable('x-powered-by')
  .use(express.static(process.env.RAZZLE_PUBLIC_DIR))
  .get('/*', (req, res) => {
    const { html } = renderApp(req, res);
    res.send(html);
  });

export default server;

That part is most interesting to me. That level of specificity is usually abstracted in other frameworks.

imballinst commented 2 years ago

@ans-4175 It seems to work:

const puppeteer = require('puppeteer');
const cheerio = require('cheerio');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://kbbi.kata.web.id/kelas-kata/kata-benda');

  const data = await page.evaluate(() => document.querySelector('*').outerHTML);

  const $ = cheerio.load(data);
  const words = [];

  $('dl > dt > a').each((_idx, element) => {
    const word = cheerio.load(element.children[0]).text();
    // Clear parentheses (if any).
    words.push(word.replace(/[()]+/g, ''));
  })
  console.log(words);
  // [
  //   'kerobohan', 'ketikan',
  //   'aba',       'abad',
  //   'abadiah',   'abaimana',
  //   'abaka',     'abakus',
  //   'aban',      'abangan'
  // ]

  await browser.close();
})();

The same can be done for other pages:

const puppeteer = require('puppeteer');
const cheerio = require('cheerio');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://kbbi.kata.web.id/kelas-kata/kata-benda/page/705');

  const data = await page.evaluate(() => document.querySelector('*').outerHTML);

  const $ = cheerio.load(data);
  const words = [];

  $('dl > dt > a').each((_idx, element) => {
    const word = cheerio.load(element.children[0]).text();
    // Clear parentheses (if any).
    words.push(word.replace(/[()]+/g, ''));
  })
  console.log(words);
  // [
  //   'konservasi',
  //   'konservasionis',
  //   'konservatisme',
  //   'konservator',
  //   'konservatorium',
  //   'konsiderasi',
  //   'konsinyasi',
  //   'konsistensi',
  //   'konsistori',
  //   'konsolasi'
  // ]

  await browser.close();
})();

But while traversing the pages, I found some words that may be rude, sensitive, or improper and I'm not sure what to do with it :sweat_smile: do we want to create a blocklist or something?

ans-4175 commented 2 years ago

I think we could ignore it for a while on "rude" stuffs, try to check how big size of words. Then try to chunk it in manner way.

At first, I think this issue improvement should scoped to: move "hardcoded" list into fetch from some endpoint then store it in localstorage, call list from localstorage for it.

In that sense we could breakdown for another issues separately: create more endpoints, etc

ans-4175 commented 2 years ago

We could improve list of kata first by adding some notable/syllable/known words in it, store it in endpoint or somewhere else, then fetch it gradually. So we could create some mechanism of "words pack" for further improvement

imballinst commented 2 years ago

Alright, I'm going to rephrase to see if I understand the direction. So, what we are going to do:

Now: implement scraper, then scrape 500-1000 words for now After this issue: scrape all words, analyze the file size, then chunk them After the words are chunked: store them in a storage or some sort so we can fetch them lazily

Is my understanding correct?

ans-4175 commented 2 years ago

Now: scrape handful of data (500ish each or 100Kb), put it in direct file in this app, call it from password-ga-kata, to check its randomness etc After Issue: add more words (optional) then store somewhere else endpoint, make this app fetch it and store in localstorage, password-ga-kata should query from localstorage (change from direct access/flatfiles to localstorage) After issue: Scrape all data, discuss how many it would be if chunked for 100Kb (TBD) each, create mechanism for app to download wordpack and store it in localstorage

Why localstorage tho? I think we could improve it as PWA for next improvement, so that app could run standalone (only fetch new wordpacks/chunks)

imballinst commented 2 years ago

Sounds good, yes, I agree with the local storage approach. This will prevent the user from re-fetching the same chunk of words over and over.

ans-4175 commented 2 years ago

Closed by #13 and continued to #14

ans-4175 / password-ga

Improvement on Kata Beda-Sifat Dictionary #9