Use metadata derived from the sitemap

divkakwani / webcorpus

Generate large textual corpora for almost any language by crawling the web

Other

8 stars 11 forks source link

Use metadata derived from the sitemap #9

Open GokulNC opened 5 years ago

GokulNC commented 5 years ago

Is it recommended? If so, do we record the time of article publication or time of crawl ?

The former might be obtained using SitemapNewsStory.publish_date.

The above API also provides stuffs like genres, keywords, etc. Will that be useful to us?

divkakwani commented 4 years ago

Time of article publication might not be useful for now, but keeping it won't hurt. We can keep all the stuff (genres, keywords, whatever times). But I think most of the news sources won't have a properly maintained sitemap. Few ways I think this can be useful:

Create more fine-grained classification datasets. For example, currently we have categories like sports. But if we can get keywords like cricket, football, tennis, then we can make the task more fine-grained and consequently, more difficult.
Article publication time could be used for new trend analysis. It can be useful for people who are going to do that kind of stuff.

We are also not crawling the sitemaps of most of the sources currently. But we can change our code to supplement our dataset with the metadata derived from the sitemap.