Preserve case of tags - Githubissues

Fixes Issues #40 and #34 Significant re-implementation of tag-related code, starting from keeping tags extracted from feeds into 3 separate lists (title, content, category), with no other processing of these tags from the feed parsers. A filter_tags function was added to GenericClient that builds a list of tags to be filtered starting from user-specified tags (tags option, formerly keywords), then using new tag_filter_opts option string (specified via comma-separated values) to include the tags from the feed unless options specified to ignore (title tags included unless ignore_title filter option was specified, content tags are included unless ignore_content filter option was specified, and category tags are included unless ignore_category filter option was specified). Once the filter list is defined, its entries are further filtered to remove any duplicates (perhaps using the filter option case-sensitive in this criterion), and limited by the max_tags option for the final tag list size. Finally, all clients now call both the GenericClient's set_common_opts function to initialize common options, and use filter_tags to define the set of tags they use in constructing their postings. Consequently, all clients now have all fundamental tag-related functionality implemented uniformly, although not all clients behave identically yet. An additional tag_opts test suite was added to verify the new tag filtering/usage options across all clients, as well as a content_tags test to further verify correct extraction of tags within feed contents. There were some problematic assertions discovered within the client_test.py module, and unfortunately a resolution was not able to be found. These assertions are commented out using #!# (along with a comment of the outstanding issue) in the hopes that further examination will provide a workable solution.

Thanks. I read quickly the changes you made, and they look nice. The need to comment out two asserts is a bit weird. A few comments:

list() is not necessary in client_test.py:47-50: the dict value type is already list.
source (same lines) is a poor variable name (conflicts with pdb's own source command).
The first problematic assert (client_test.py:50) works well for the first call. The second call doesn't work as the Mort numérique entry is expected to contain the keywords animals, vegetable and mineral. These keywords belong to the Atom-Powered Robots Run Amok entry. Same issue as...
... the second problematic assert (client_test.py:215)

The conversion of the entry_generator to a list(client_test.py:36) is causing the problem. Adding a breakpoint right after it, and typing: [id(x.tags) for x in entries] returns several times the same id. This shouldn't happen, as all entries should have their own tags object.

This issue comes from the declaration of the tags attribute of FeedSporaEntry (lines 38-41). The dict is instantiated when the class is parsed, and is consequently shared between all its instances. Instead, the tags attribute should be set to None in the class declaration, and should be initialized with a brand new dict by adding: fse.tags = dict() before being accessed (line 311).

Can you test that this modification of FeedSporaEntry works? For some crazy reason the tests are now working on my machine: I can't reproduce the bug, and cannot check this approach can fix it :-/

Note: The test can be also be modified by skipping the list conversion. Since a FeedSporaEntry is created upon request and only one of them live in memory at the same time, this would probably work:

def check(client, entry_generator, expected, check_entry):

    count = 0

    for entry, expect in zip(entry_generator, expected[::-1]):
        count += 1
        # Check that items in the feed read from disk and the expected
        # datastructure are the same
        src = expect['source']
        assert entry.title == src['title']
        assert entry.link == src['link']
        # Important to test these in a list context, as ordering matters
        assert entry.tags['title'] == src['tags']['title']
        assert entry.tags['content'] == src['tags']['content']
        assert entry.tags['category'] == src['tags']['category']
        client_key = str(type(client)).split('.')[-1][:-2]
        # Check that expected values are defined for 'title' and 'link'
        expect_key = client_key if client_key in expect else 'source'
        assert expect[expect_key]['title'] != ''
        assert expect[expect_key]['link'] != ''
        # Test the client post return
        returned = client.post(entry)
        assert returned is not None
        check_entry(returned, expect[expect_key])

    assert count == len(expected)

I managed to make the tests fail again, the following patch make it work :-)

diff --git a/src/feedspora/feedspora_runner.py b/src/feedspora/feedspora_runner.py
index 9d4f59c..9989094 100644
--- a/src/feedspora/feedspora_runner.py
+++ b/src/feedspora/feedspora_runner.py
@@ -20,9 +20,10 @@ import os
 import re
 import sqlite3

+import lxml.html
 import requests
 from bs4 import BeautifulSoup
-import lxml.html
+

 # pylint: disable=too-few-public-methods
 class FeedSporaEntry:
@@ -35,11 +36,10 @@ class FeedSporaEntry:
     link = ''
     published_date = None
     content = ''
-    tags = {'title': [],
-            'content': [],
-            'category': [],
-            }
+    tags = None
     media_url = None
+
+
 # pylint: enable=too-few-public-methods

@@ -241,6 +241,7 @@ class FeedSpora:
         '''
         title_tags = []
         # Add tags from title
+
         for word in title.split():
             if word.startswith('#') and word[1:] not in title_tags:
                 title_tags.append(word[1:])
@@ -248,6 +249,7 @@ class FeedSpora:
         # Add tags from end of content (removing from content in the
         # process of gathering tags)
         content_tags = []
+
         if content:
             # Remove tags to improve processing
             content = lxml.html.fromstring(content).text_content().strip()
@@ -264,9 +266,11 @@ class FeedSpora:

             tag_pattern = r'^\s*#([\w]+)$'
             match_result = re.search(tag_pattern, content)
+
             if match_result:
                 # Left with a single tag!
                 tag = match_result.group(1)
+
                 if tag not in content_tags:
                     content_tags.insert(0, tag)
                 content = ''
@@ -307,14 +311,17 @@ class FeedSpora:
             if fse.content is None:
                 fse.content = ''

+            fse.tags = dict()
             # Tags from title and content, each in their own list
             fse.tags['title'], fse.tags['content'] = self.get_tag_lists(
                 fse.title, fse.content)

             # Add tags from category
             fse.tags['category'] = []
+
             for tag in entry.find_all('category'):
                 new_tag = tag['term'].replace(' ', '_').strip()
+
                 if new_tag not in fse.tags['category']:
                     fse.tags['category'].append(new_tag)

@@ -405,12 +412,14 @@ class FeedSpora:
             # PubDate
             fse.published_date = entry.find('pubdate').text

+            fse.tags = dict()
             # tags from title and content
             fse.tags['title'], fse.tags['content'] = self.get_tag_lists(
                 fse.title, fse.content)

             # Add tags from category
             fse.tags['category'] = []
+
             for tag in entry.find_all('category'):
                 new_tag = tag.text.replace(' ', '_').strip()

diff --git a/tests/cache.sqlite b/tests/cache.sqlite
index 3c68bbe..31585f8 100644
Binary files a/tests/cache.sqlite and b/tests/cache.sqlite differ
diff --git a/tests/client_test.py b/tests/client_test.py
index e692937..4134b55 100644
--- a/tests/client_test.py
+++ b/tests/client_test.py
@@ -46,8 +46,7 @@ def check(client, entry_generator, expected, check_entry):
         # Important to test these in a list context, as ordering matters
         assert list(entry.tags['title']) == list(source['tags']['title'])
         assert list(entry.tags['content']) == list(source['tags']['content'])
-        #!# Why is the right side of this assertion always returning []?
-        #!#assert list(entry.tags['category']) == list(source['tags']['category'])
+        assert list(entry.tags['category']) == list(source['tags']['category'])
         client_key = str(type(client)).split('.')[-1][:-2]
         # Check that expected values are defined for 'title' and 'link'
         expect_key = client_key if client_key in expect else 'source'
@@ -212,12 +211,9 @@ def test_ShaarpyClient(entry_generator, expected):
     def check_entry(returned, expected):
         assert returned['title'].index(expected['title']) > -1
         assert returned['link'].index(expected['link']) > -1
-        #!# This fails on the second entry posting, apparently
-        #!# because returned['tags'] still has the value from the previous
-        #!# entry; somehow not getting re-initialized correctly
-        #!#assert returned['tags'] == expected['tags']['title'] + \
-        #!#                           expected['tags']['content'] + \
-        #!#                           expected['tags']['category']
+        assert returned['tags'] == expected['tags']['title'] + \
+                                   expected['tags']['content'] + \
+                                   expected['tags']['category']

     old_init = ShaarpyClient.__init__
     ShaarpyClient.__init__ = new_init

Great, thank you for tracking that down. I knew it was most likely an initialization issue of some sort. I've implemented all the functional changes you suggested, and all the tests are now passing, including the restored assertions.

aurelg / feedspora

Preserve case of tags #46