cantino / ruby-readability

Port of arc90's readability project to Ruby
Apache License 2.0
925 stars 170 forks source link

Readbility's title #64

Open tybenz opened 10 years ago

tybenz commented 10 years ago

Readability pulls its article title from the title tag right? Well more often than not, the title tag has a whole lot of other information besides just the title of the article. It usually includes the title of the site itself and sometimes a category.

I know the original readability script just grabbed the title, but I'm wondering if this version of the script can be modified to grab the actual title of the article from the markup. It seems as though the scoring system is set up to exclude the header tag that contains the article title.

Example:

<article>
  <div class="article-title">
    <h1>Article title</h1>
  </div>
  <div class="article-content">
    <p>
      Claritatem insitam; est usus legentis in iis qui facit eorum claritatem.
      Investigationes demonstraverunt lectores legere me lius quod ii legunt
      saepius. Claritas est etiam processus dynamicus, qui sequitur mutationem
      consuetudium lectorum. Mirum est notare quam littera gothica, quam nunc
      putamus parum claram, anteposuerit litterarum formas humanitatis per seacula
      quarta decima et quinta decima. Eodem modo typi, qui nunc nobis videntur
      parum clari, fiant sollemnes in futurum.
    </p>
    <p>
      Nunc varius risus quis nulla. Vivamus vel magna. Ut rutrum. Aenean
      dignissim, leo quis faucibus semper, massa est faucibus massa, sit amet
      pharetra arcu nunc et sem. Aliquam tempor. Nam lobortis sem non urna.
      Pellentesque et urna sit amet leo accumsan volutpat. Nam molestie lobortis
      lorem. Quisque eu nulla. Donec id orci in ligula dapibus egestas. Donec sed
      velit ac lectus mattis sagittis.
    </p>
  </div>
</article>

In the above example, readability will always grab the content from .article-content and not the <article> tag itself. What can I do to modify the script to grab the whole article, title and all?

cantino commented 10 years ago

Hey @tybenz! Interesting idea. Do you want to work on a pull request for that?

tybenz commented 10 years ago

Yeah. I'd love to. I don't know enough about the scoring algorithm though. Wondering if you had any ideas on what a good start might be.

cantino commented 10 years ago

No problem. I'd try to write a failing spec, then I'd take a look at score_node, class_weight, and REGEXES and see if something similar could be written to estimate which node is the title.

tybenz commented 10 years ago

Also, I want to get something straight. Is it true that you only ever score p tags, td tags, and their parents and grandparents?

https://github.com/cantino/ruby-readability/blob/master/lib/readability.rb#L270-L271

Am I missing something?

tuzz commented 11 months ago

Sorry to necro this issue. Yes, that's right @tybenz, it only scores <p>, <td> and their parents and grand parents.

Today I opened a pull request to allow you to specify other nodes to score, such as <h1> elements that might be nested inside a <header> element which would not be included in the list of candidates. See https://github.com/cantino/ruby-readability/pull/93