Closed wilddeej closed 5 years ago
Thanks. I read quickly the changes you made, and they look nice. The need to comment out two asserts is a bit weird. A few comments:
list()
is not necessary in client_test.py:47-50: the dict value type is already list
.source
(same lines) is a poor variable name (conflicts with pdb
's own source
command).Mort numérique
entry is expected to contain the keywords animals
, vegetable
and mineral
. These keywords belong to the Atom-Powered Robots Run Amok
entry. Same issue as...The conversion of the entry_generator
to a list
(client_test.py:36) is causing the problem. Adding a breakpoint right after it, and typing: [id(x.tags) for x in entries]
returns several times the same id. This shouldn't happen, as all entries should have their own tags
object.
This issue comes from the declaration of the tags
attribute of FeedSporaEntry
(lines 38-41). The dict is instantiated when the class is parsed, and is consequently shared between all its instances. Instead, the tags
attribute should be set to None
in the class declaration, and should be initialized with a brand new dict by adding: fse.tags = dict()
before being accessed (line 311).
Can you test that this modification of FeedSporaEntry
works? For some crazy reason the tests are now working on my machine: I can't reproduce the bug, and cannot check this approach can fix it :-/
Note: The test can be also be modified by skipping the list conversion. Since a FeedSporaEntry
is created upon request and only one of them live in memory at the same time, this would probably work:
def check(client, entry_generator, expected, check_entry):
count = 0
for entry, expect in zip(entry_generator, expected[::-1]):
count += 1
# Check that items in the feed read from disk and the expected
# datastructure are the same
src = expect['source']
assert entry.title == src['title']
assert entry.link == src['link']
# Important to test these in a list context, as ordering matters
assert entry.tags['title'] == src['tags']['title']
assert entry.tags['content'] == src['tags']['content']
assert entry.tags['category'] == src['tags']['category']
client_key = str(type(client)).split('.')[-1][:-2]
# Check that expected values are defined for 'title' and 'link'
expect_key = client_key if client_key in expect else 'source'
assert expect[expect_key]['title'] != ''
assert expect[expect_key]['link'] != ''
# Test the client post return
returned = client.post(entry)
assert returned is not None
check_entry(returned, expect[expect_key])
assert count == len(expected)
I managed to make the tests fail again, the following patch make it work :-)
diff --git a/src/feedspora/feedspora_runner.py b/src/feedspora/feedspora_runner.py
index 9d4f59c..9989094 100644
--- a/src/feedspora/feedspora_runner.py
+++ b/src/feedspora/feedspora_runner.py
@@ -20,9 +20,10 @@ import os
import re
import sqlite3
+import lxml.html
import requests
from bs4 import BeautifulSoup
-import lxml.html
+
# pylint: disable=too-few-public-methods
class FeedSporaEntry:
@@ -35,11 +36,10 @@ class FeedSporaEntry:
link = ''
published_date = None
content = ''
- tags = {'title': [],
- 'content': [],
- 'category': [],
- }
+ tags = None
media_url = None
+
+
# pylint: enable=too-few-public-methods
@@ -241,6 +241,7 @@ class FeedSpora:
'''
title_tags = []
# Add tags from title
+
for word in title.split():
if word.startswith('#') and word[1:] not in title_tags:
title_tags.append(word[1:])
@@ -248,6 +249,7 @@ class FeedSpora:
# Add tags from end of content (removing from content in the
# process of gathering tags)
content_tags = []
+
if content:
# Remove tags to improve processing
content = lxml.html.fromstring(content).text_content().strip()
@@ -264,9 +266,11 @@ class FeedSpora:
tag_pattern = r'^\s*#([\w]+)$'
match_result = re.search(tag_pattern, content)
+
if match_result:
# Left with a single tag!
tag = match_result.group(1)
+
if tag not in content_tags:
content_tags.insert(0, tag)
content = ''
@@ -307,14 +311,17 @@ class FeedSpora:
if fse.content is None:
fse.content = ''
+ fse.tags = dict()
# Tags from title and content, each in their own list
fse.tags['title'], fse.tags['content'] = self.get_tag_lists(
fse.title, fse.content)
# Add tags from category
fse.tags['category'] = []
+
for tag in entry.find_all('category'):
new_tag = tag['term'].replace(' ', '_').strip()
+
if new_tag not in fse.tags['category']:
fse.tags['category'].append(new_tag)
@@ -405,12 +412,14 @@ class FeedSpora:
# PubDate
fse.published_date = entry.find('pubdate').text
+ fse.tags = dict()
# tags from title and content
fse.tags['title'], fse.tags['content'] = self.get_tag_lists(
fse.title, fse.content)
# Add tags from category
fse.tags['category'] = []
+
for tag in entry.find_all('category'):
new_tag = tag.text.replace(' ', '_').strip()
diff --git a/tests/cache.sqlite b/tests/cache.sqlite
index 3c68bbe..31585f8 100644
Binary files a/tests/cache.sqlite and b/tests/cache.sqlite differ
diff --git a/tests/client_test.py b/tests/client_test.py
index e692937..4134b55 100644
--- a/tests/client_test.py
+++ b/tests/client_test.py
@@ -46,8 +46,7 @@ def check(client, entry_generator, expected, check_entry):
# Important to test these in a list context, as ordering matters
assert list(entry.tags['title']) == list(source['tags']['title'])
assert list(entry.tags['content']) == list(source['tags']['content'])
- #!# Why is the right side of this assertion always returning []?
- #!#assert list(entry.tags['category']) == list(source['tags']['category'])
+ assert list(entry.tags['category']) == list(source['tags']['category'])
client_key = str(type(client)).split('.')[-1][:-2]
# Check that expected values are defined for 'title' and 'link'
expect_key = client_key if client_key in expect else 'source'
@@ -212,12 +211,9 @@ def test_ShaarpyClient(entry_generator, expected):
def check_entry(returned, expected):
assert returned['title'].index(expected['title']) > -1
assert returned['link'].index(expected['link']) > -1
- #!# This fails on the second entry posting, apparently
- #!# because returned['tags'] still has the value from the previous
- #!# entry; somehow not getting re-initialized correctly
- #!#assert returned['tags'] == expected['tags']['title'] + \
- #!# expected['tags']['content'] + \
- #!# expected['tags']['category']
+ assert returned['tags'] == expected['tags']['title'] + \
+ expected['tags']['content'] + \
+ expected['tags']['category']
old_init = ShaarpyClient.__init__
ShaarpyClient.__init__ = new_init
Great, thank you for tracking that down. I knew it was most likely an initialization issue of some sort. I've implemented all the functional changes you suggested, and all the tests are now passing, including the restored assertions.
Fixes Issues #40 and #34 Significant re-implementation of tag-related code, starting from keeping tags extracted from feeds into 3 separate lists (title, content, category), with no other processing of these tags from the feed parsers. A filter_tags function was added to GenericClient that builds a list of tags to be filtered starting from user-specified tags (
tags
option, formerlykeywords
), then using newtag_filter_opts
option string (specified via comma-separated values) to include the tags from the feed unless options specified to ignore (title tags included unlessignore_title
filter option was specified, content tags are included unlessignore_content
filter option was specified, and category tags are included unlessignore_category
filter option was specified). Once the filter list is defined, its entries are further filtered to remove any duplicates (perhaps using the filter optioncase-sensitive
in this criterion), and limited by themax_tags
option for the final tag list size. Finally, all clients now call both the GenericClient's set_common_opts function to initialize common options, and use filter_tags to define the set of tags they use in constructing their postings. Consequently, all clients now have all fundamental tag-related functionality implemented uniformly, although not all clients behave identically yet. An additionaltag_opts
test suite was added to verify the new tag filtering/usage options across all clients, as well as acontent_tags
test to further verify correct extraction of tags within feed contents. There were some problematic assertions discovered within theclient_test.py
module, and unfortunately a resolution was not able to be found. These assertions are commented out using#!#
(along with a comment of the outstanding issue) in the hopes that further examination will provide a workable solution.