Closed Weyaaron closed 1 year ago
I think points 1 and 3 are not controversial at all, I will start working on them. The other may require some more discussion.
All of these points have been addressed. And the contribution these are based upon is done(#305 ), therefore this can be closed.
[x] sometimes the next says we’ll crawl 5 articles but the code only crawls two
[x] In Tutorial 2 about the
Article
class, maybe state more clearly your definitions add oftitle
,summary
,section
,subheadline
and so on. Makes it easier to follow and to try out the different parts. Also: What is DOM?[x] Tutorial 3 for filtering the articles, e. g. by author: As much as I like Donald Duck, maybe one could chose a different, more realistic, example so users actually get good results instead of waiting, maybe wondering if the system is still running or what 😅 Maybe filtering for a specific publishing time range, language or topic?
[x] More info or examples on how to find the specific URLs to Sitemap, NewsMap, RSSFeed, as well as what their differences and purposes are, and how/why the need to differentiate them. At least for me, this was new and I found myself a bit puzzled at first.
[x] Most importantly I would have liked more help for the basics of how to use the
CSSSelect
andXPath
selectors, which is obviously the most work when adding a parser. I think that here, a small list of examples with html and some fitting example selectors, together with explanations what they do would be very helpful!Originally posted by @susannaruecker in https://github.com/flairNLP/fundus/issues/296#issuecomment-1645568297
Edit: I cut the original text down to bullet points.