fossasia / loklak_scraper_js

Scrapers for loklak in javascript
GNU Lesser General Public License v2.1
1.48k stars 16 forks source link

Process date fetched to standard format in TimeAndDate scraper #49

Open vibhcool opened 7 years ago

vibhcool commented 7 years ago

There is one issue in TimeAndDate scraper that it doesn't process the date fetched to standard format in which it can be directly used . Something like: Thu Apr 06 15:14:32 IST 2017 or 2017-04-06T09:44:32.000Z

this scraper requires processing of the fetched data.

vibhcool commented 7 years ago

@brainstormm is working on this issue.

brainstormm commented 7 years ago

Analysis till now :

Suppose i queried for London https://www.timeanddate.com/worldclock/results.html?query=London Analysis : 1 : "Day" and "Month-name" are encoded in Hindi . for example -> "Wed" converted to "बुधवार"; "August converted to "अगस्त"; 2 : Time format is somehow converted as follows : for example -> "12:21" converted to "12.21" (notice colon replaced by dot). 3 : Analyzing source code directly (from the website "view source as") shows no language change.

Doubts/Queries : 1: Is it always converting to Hindi ? Or say if a user is in Russia or come other country, will it show them to hindi or in some other language ? 2 : Why is this getting converted to Hindi if it is just a simple scrapping , is there some bug in request library ?

Options tried : 1 : Tried different UTF-8 encodings : None worked. 2 : Google translate API , but it is not free : Dropped the idea. 3 : Thought of translating the hindi text to english by using google translate through a url query (https://translate.google.com/#auto/en/बुधवार) and then scrapping the result , but guess what google is smarter than us, the keyword "Wednesday" can not be found in the scrapped result. : Dropped the idea.