Closed ans-4175 closed 2 years ago
Hey! I wish to attempt doing this issue. Here's my rough idea so far:
puppeteer
+ cheerio
. The first is to go to the page, the second is to parse the HTML. I'm going to fetch the nouns from https://kbbi.kata.web.id/kelas-kata/kata-benda/ and the adjectives from https://kbbi.kata.web.id/kelas-kata/kata-sifat. Since it has a lot of pages (like, ~1k+ for nouns and ~300 for adjectives), I probably will take random page numbers and read the text content of article > dl > dt > a
(as shown on the picture below):interface WordsDictionary {
adjectives: string[];
nouns: string[];
}
I have yet to test (1), the most important part of this idea, though. I will do so in a bit.
WDYT @ans-4175?
First uncertainties we would tackle, is how big that data would be? Assumed scraping could be done.
time-to-interactivity of a page
, what if we have a vast amount of data, then client app would "download" certain lists and store it in localstorage? So basically we could called it from localstorage, that would help lot on network transfer at first. User also can ask for more words to be download if needed.
At first I want to make this app plain dead simple, but thinking of moving it to some kind fullstack framework. So that this app would have FE+BE in one code structure. If we are going to make this, we need to have another issues that tackle moving codebase from CRA to new one. I could do this, with Razzle as my first choice of creating fullstack app :)
But I think to make PoC on this one, you could "scrape" certain list and put it in directly in password-ga-kata.js
then check password strength and randomness. For me, localstorage seems good to go choices if needed.
Also, we could put JSON in some cloud storage (like JSONBin) then fetch it gradually by each bucket. So that it would not affect first time to render of page and we don't need to move our codebase structure
what if we have a vast amount of data, then client app would "download" certain lists and store it in localstorage? So basically we could called it from localstorage, that would help lot on network transfer at first. User also can ask for more words to be download if needed.
Yeah, that seems like a good option. By saving to local storage, for subsequent visits, users don't have to re-fetch the words anymore. One constraint that I can think of right now is that the max size of entire local storage is 5MB: https://developer.mozilla.org/en-US/docs/Web/API/Web_Storage_API -- in case it even reaches that big.
But I think to make PoC on this one, you could "scrape" certain list and put it in directly in password-ga-kata.js then check password strength and randomness. For me, localstorage seems good to go choices if needed.
Alright!
At first I want to make this app plain dead simple, but thinking of moving it to some kind fullstack framework. So that this app would have FE+BE in one code structure. If we are going to make this, we need to have another issues that tackle moving codebase from CRA to new one. I could do this, with Razzle as my first choice of creating fullstack app :)
Interesting, I just looked up on the Razzle examples. At the moment my go-to is Next.js, but if I compare it with Razzle, I take it that Razzle is more "composable", in some ways? Like, this file:
server
.disable('x-powered-by')
.use(express.static(process.env.RAZZLE_PUBLIC_DIR))
.get('/*', (req, res) => {
const { html } = renderApp(req, res);
res.send(html);
});
export default server;
That part is most interesting to me. That level of specificity is usually abstracted in other frameworks.
@ans-4175 It seems to work:
const puppeteer = require('puppeteer');
const cheerio = require('cheerio');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://kbbi.kata.web.id/kelas-kata/kata-benda');
const data = await page.evaluate(() => document.querySelector('*').outerHTML);
const $ = cheerio.load(data);
const words = [];
$('dl > dt > a').each((_idx, element) => {
const word = cheerio.load(element.children[0]).text();
// Clear parentheses (if any).
words.push(word.replace(/[()]+/g, ''));
})
console.log(words);
// [
// 'kerobohan', 'ketikan',
// 'aba', 'abad',
// 'abadiah', 'abaimana',
// 'abaka', 'abakus',
// 'aban', 'abangan'
// ]
await browser.close();
})();
The same can be done for other pages:
const puppeteer = require('puppeteer');
const cheerio = require('cheerio');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://kbbi.kata.web.id/kelas-kata/kata-benda/page/705');
const data = await page.evaluate(() => document.querySelector('*').outerHTML);
const $ = cheerio.load(data);
const words = [];
$('dl > dt > a').each((_idx, element) => {
const word = cheerio.load(element.children[0]).text();
// Clear parentheses (if any).
words.push(word.replace(/[()]+/g, ''));
})
console.log(words);
// [
// 'konservasi',
// 'konservasionis',
// 'konservatisme',
// 'konservator',
// 'konservatorium',
// 'konsiderasi',
// 'konsinyasi',
// 'konsistensi',
// 'konsistori',
// 'konsolasi'
// ]
await browser.close();
})();
But while traversing the pages, I found some words that may be rude, sensitive, or improper and I'm not sure what to do with it :sweat_smile: do we want to create a blocklist or something?
I think we could ignore it for a while on "rude" stuffs, try to check how big size of words. Then try to chunk it in manner way.
At first, I think this issue improvement should scoped to: move "hardcoded" list into fetch from some endpoint then store it in localstorage, call list from localstorage for it.
In that sense we could breakdown for another issues separately: create more endpoints, etc
We could improve list of kata first by adding some notable/syllable/known words in it, store it in endpoint or somewhere else, then fetch it gradually. So we could create some mechanism of "words pack" for further improvement
Alright, I'm going to rephrase to see if I understand the direction. So, what we are going to do:
Now: implement scraper, then scrape 500-1000 words for now After this issue: scrape all words, analyze the file size, then chunk them After the words are chunked: store them in a storage or some sort so we can fetch them lazily
Is my understanding correct?
Now: scrape handful of data (500ish each or 100Kb), put it in direct file in this app, call it from password-ga-kata, to check its randomness etc After Issue: add more words (optional) then store somewhere else endpoint, make this app fetch it and store in localstorage, password-ga-kata should query from localstorage (change from direct access/flatfiles to localstorage) After issue: Scrape all data, discuss how many it would be if chunked for 100Kb (TBD) each, create mechanism for app to download wordpack and store it in localstorage
Why localstorage tho? I think we could improve it as PWA for next improvement, so that app could run standalone (only fetch new wordpacks/chunks)
Sounds good, yes, I agree with the local storage approach. This will prevent the user from re-fetching the same chunk of words over and over.
Closed by #13 and continued to #14
This is beta version of password by kata benda-sifat https://github.com/ans-4175/password-ga/blob/main/src/libs/password-ga-kata.js#L1 Could be improved by vast array of words or random API get words