ana-kuznetsova / Popular-Science-Texts-Compling-research

An M.A. educational project on computational linguistics.
4 stars 3 forks source link

Review Polit.ru lectures & Indicator #4

Closed ana-kuznetsova closed 6 years ago

ana-kuznetsova commented 6 years ago

Review and analyse http://www.polit.ru/lectures/publ_lect/ and https://indicator.ru/ according to the checklist:

First-person speech; Set of rubrics & headings; Layout of expert's opinion in html code (if there is one).

JuliaKolomenskaya commented 6 years ago

Please note that this a general overview; thorough detailed analysis will be given after all the data obtained/collected.

Polit.ru/lectures offers a remarkable variety of public lectures given by the representatives of various scientific fields.

Site structure: not html 5 - > does not support semantic mark-down so that makes it harder to crawl

Rubric structure: the articles can be filtered by authors, topics or chronologically (the best option is to crawl the pages chronologically - up to the earliest publication - 20.12.2004)

Article structure:

  1. The page template is the same for all the lecture articles (which is a great plus for further analysis)
  2. The main heading is encoded as <h1.title>; there is only one h1 title per article (SEO best practice, also is a huge plus). This heading identifies the main topic within the name of the article.
  3. The lead paragraph can be found in a meta tag or by the css selector 'div.content div p em'. The lead paragraph is very important since there are articles containing only video lectures or a lecture announce. The lead paragraph of an article of interest contains 'мы публикуем стенограмму/расшифровку лекции' (usually in the first sentence). Plus the lead paragraph contains the lecturer name in bold (<strong></strong>).
  4. The interview structure is included into the text body. The interview participants are marked in bold at the beginning of each phrase paragraph.
  5. Discussion/обсуждение splits an article by 'обсуждение' case word.
  6. The user comments are disabled.
JuliaKolomenskaya commented 6 years ago

Indicator by Rambler&Co mediaholding is an info-service portal about science. It updates the latest news from the Russian and world scintific community on a daily basis. This resourse also represents polemical articles about Russian scientific system and scince and business relations (can be of great interest for our research). The crucial content for us in this resource is lectures and interviews by the famous scientists (BUT! Translated interviews of foreign scientists is also included - Should we consider them?).

Site structure: not html 5 - > does not support semantic mark-down so that makes it harder to crawl

Rubrics structure: every piece of content is tagged according to 1 of 10 topics (Astronomy, Biology, Humanities etc.). Apart from that every publication is subcategorized to News, Discoveries of Russian Scientists and Discussion Club. We can also trace further devision of the articles into smaller rubrics not filtered on site by default (their names are given either in the headtitle or in the lead paragraph, e.g. "...рассказывает сегодняшний выпуск рубрики «История науки»"). For now the most plausible strategy seems to crawl the topic pages since all the genres will be found there anyway.

Article structure:

  1. The page template looks the same for almost all the articles (there are some exeptions).
  2. The main heading is encoded as div.headline__title; only one per page (again - this is a plus), but can be semantically tricky (wordplay, pun etc). The following subtitle gives more specific info, also only one per page (<h1><\h1>). Can be omitted.
  3. The lead paragraph can be found by css selector 'div.typo p em' (some other elemets are also encoded the same way which is not very convenient). It gives the jist of the article and the info about the author (his/her name, field of research, profession etc). In case the author is a staff writer, it can be found by css selector 'div.article__author'(one per page).
  4. The interview structure is included into the text body. The interviewer words are given in bold.
  5. In big review articles the opinions and comments of the experts are given in the text body as direct speech without specific encoding. Sometimes expert comments are visually and grafically marked (css selector 'div.widget-opiniontext' for the comment itself and 'div.widget-opinionauthor' for the author name. Note: quatations from fiction and references are encoded in the same way).
  6. The news articles have no subtitle, no lead paragraph and no author but sometimes can contain comment of an expert represented as direct speech - should we consider them relevant for our research?
  7. The user comments are enabled but seems to be of little interest so we'd better disregard them.