divkakwani / webcorpus

Generate large textual corpora for almost any language by crawling the web
Other
8 stars 10 forks source link

Use metadata derived from the sitemap #9

Open GokulNC opened 5 years ago

GokulNC commented 5 years ago

Is it recommended? If so, do we record the time of article publication or time of crawl ?

The former might be obtained using SitemapNewsStory.publish_date.

The above API also provides stuffs like genres, keywords, etc. Will that be useful to us?

divkakwani commented 4 years ago

Time of article publication might not be useful for now, but keeping it won't hurt. We can keep all the stuff (genres, keywords, whatever times). But I think most of the news sources won't have a properly maintained sitemap. Few ways I think this can be useful:

We are also not crawling the sitemaps of most of the sources currently. But we can change our code to supplement our dataset with the metadata derived from the sitemap.