Open GokulNC opened 5 years ago
Time of article publication might not be useful for now, but keeping it won't hurt. We can keep all the stuff (genres, keywords, whatever times). But I think most of the news sources won't have a properly maintained sitemap. Few ways I think this can be useful:
We are also not crawling the sitemaps of most of the sources currently. But we can change our code to supplement our dataset with the metadata derived from the sitemap.
Is it recommended? If so, do we record the
time of article publication
ortime of crawl
?The former might be obtained using
SitemapNewsStory.publish_date
.The above API also provides stuffs like
genres
,keywords
, etc. Will that be useful to us?