DistrictDataLabs / baleen

An automated ingestion service for blogs to construct a corpus for NLP research.
MIT License
86 stars 38 forks source link

Add load from csv #90

Open janetriley opened 7 years ago

janetriley commented 7 years ago

console/commands/load can handle OPML files.

I don't have OPML, and couldn't easily find an OPML editor. CSV is easy to compose, however.

Add support for loading feeds from CSV.

bbengfort commented 7 years ago

Great idea! In terms of OPML editor, we actually used feedly which has an export to OPML feature. However, CSV is a great feature to add!

janetriley commented 7 years ago

It looks like the required fields to create a Feed are link and category, with an optional title.

Is that right?

Here's my understanding of a Feed:

from baleen.models:

class Feed(me.DynamicDocument):
   # my (optional) title for this feed
    title = me.StringField(max_length=256)  

    # the link to get the RSS feed. FeedParser may update it during sync if it sees a different href. 
    link = me.URLField(required=True, unique=True)  

    #  A dict of xmlURL, which is the link above, and an htmlURL, which is ...?  the human-friendly version of the site? 
    urls = me.DictField()

   # my name for the collection of documents  - like a corpus name. One category per feed.
    category = me.StringField(required=True)

   # for Baleen - guessing the Job ignores inactive feeds
    active = me.BooleanField(default=True)

    # fields that the FeedParser package modifies
    version = me.StringField(choices=FEEDTYPES)
    etag = me.StringField()
    modified = me.StringField()
    fetched = me.DateTimeField(default=None)
    signature = me.StringField(max_length=64, min_length=64, unique=False)

    created = me.DateTimeField(default=datetime.now, required=True)
    updated = me.DateTimeField(default=datetime.now, required=True)

Am I heading in the right direction? This is simpler than I was expecting.

bbengfort commented 7 years ago

Yep, that's pretty much correct - the OPML file doesn't contain much information - title and link are by far the most important, with category and active being of secondary importance.