Mandatory

[x] I read the documentation (readme and wiki).
[x] I searched other issues (including closed issues) and could not find any to be related. If you find related issues post them below or directly add your issue to the most related one.

Related issues:

Describe the bug Adding a pipeline for Postgresql in config.cfg gives an error message reading "psycopg2.ProgrammingError: no results to fetch error" sitelist.hjson has remained unchanged. Doing select statements on both tables in psql shows that no data has been added to the database.

config.cfg


# IMPORTANT
# All variables get parsed to the correct python-types (if not other declared)!
# So bools have to be True or False (uppercase-first),
# Floats need dots . (not comma)
# Ints are just normal ints
# dicts need to be like this { key: value }
# arrays need to be like this [ value1, value2, value3 ]
# All values in dicts and arrays will also be parsed.
# Everything that does not match any of the above criteria will be parsed as string.

[Crawler]

GENERAL

-------

Crawling heuristics

Default Crawlers:

Possibilities: RecursiveCrawler, RecursiveSitemapCrawler, RssCrawler, SitemapCrawler, Download (./newsplease/crawler/spiders/-dir)

default: SitemapCrawler

default = SitemapCrawler

default:

fallbacks = {

"RssCrawler": None,

"RecursiveSitemapCrawler": "RecursiveCrawler",

"SitemapCrawler": "RecursiveCrawler",

"RecursiveCrawler": None,

"Download": None

}

fallbacks = { "RssCrawler": None, "RecursiveSitemapCrawler": "RecursiveCrawler", "SitemapCrawler": "RecursiveCrawler", "RecursiveCrawler": None, "Download": None }

Determines how many hours need to pass since the last download of a webpage

to be downloaded again by the RssCrawler

default: 6

hours_to_pass_for_redownload_by_rss_crawler = 6

PROCESSES

---------

Number of crawlers, that should crawl parallel

not counting in daemonized crawlers

default: 5

number_of_parallel_crawlers = 5

Number of daemons, will be added to daemons.

default: 10

number_of_parallel_daemons = 10

SPECIAL CASES

-------------

urls which end on any of the following file extensions are ignored for recursive crawling

default: "(pdf)|(docx?)|(xlsx?)|(pptx?)|(epub)|(jpe?g)|(png)|(bmp)|(gif)|(tiff)|(webp)|(avi)|(mpe?g)|(mov)|(qt)|(webm)|(ogg)|(midi)|(mid)|(mp3)|(wav)|(zip)|(rar)|(exe)|(apk)|(css)"

ignore_file_extensions = "(pdf)|(docx?)|(xlsx?)|(pptx?)|(epub)|(jpe?g)|(png)|(bmp)|(gif)|(tiff)|(webp)|(avi)|(mpe?g)|(mov)|(qt)|(webm)|(ogg)|(midi)|(mid)|(mp3)|(wav)|(zip)|(rar)|(exe)|(apk)|(css)"

urls which match the following regex are ignored for recursive crawling

default: ""

ignore_regex = ""

Crawl the sitemaps of subdomains (if sitemap is enabled)

If True, any SitemapCrawler will try to crawl on the sitemap of the given domain including subdomains instead of a domain's main sitemap.

e.g. if True, a SitemapCrawler to be started on https://blog.zeit.de will try to crawl on the sitemap listed in http://blog.zeit.de/robots.txt. If not found, it will fall back to the False setting.

if False, a SitemapCrawler to be started on https://blog.zeit.de will try to crawl on the sitemap listed in http://zeit.de/robots.txt

default: True

sitemap_allow_subdomains = True

[Heuristics]

Enabled heuristics,

Currently:

- og_type

- linked_headlines

- self_linked_headlines

- is_not_from_subdomain (with this setting enabled, it can be assured that only pages that aren't from a subdomain are downloaded)

- meta_contains_article_keyword

- crawler_contains_only_article_alikes

(maybe not up-to-date, see ./newsplease/helper_classes/heursitics.py:

Every method not starting with __ should be a heuristic, except is_article)

These heuristics can be overwritten by sitelist.json for each site

default: {"og_type": True, "linked_headlines": "<=0.65", "self_linked_headlines": "<=0.56"}

enabled_heuristics = {"og_type": True, "linked_headlines": "<=0.65", "self_linked_headlines": "<=0.56"}

Heuristics can be combined with others

The heuristics need to have the same name as in enabled_heuristics

Possible condition-characters / literals are: (, ), not, and, or

All heuristics used here need to be enabled in enabled_heuristics as well!

Examples:

"og_type and (self_linked_headlines or linked_headlines)"

"og_type"

default: "og_type and (linked_headlines or self_linked_headlines)"

pass_heuristics_condition = "og_type and (linked_headlines or self_linked_headlines)"

The maximum ratio of headlines divided by linked_headlines in a file

The minimum number of headlines in a file to check for the ratio

If less then this number are in the file, the file will pass the test.

default: 5

min_headlines_for_linked_test = 5

[Files]

GENERAL:

-------

Paths:

toggles relative paths to be relative to the start_processes.py script (True) or relative to this config file (False)

This does not work for this config's 'Scrapy' section which is always relative to the dir the start_processes.py script is called from

Default: True

relative_to_start_processes_file = True

INPUT:

-----

Here you can specify the input JSON-Filename

default: sitelist.hjson

url_input_file_name = sitelist.hjson

OUTPUT:

------

Toggles whether leading './' or '.\' from above local_data_directory should be removed when saving the path into the Database

True: ./data would become data

default: True

working_path = ~/news-please-repo/

Following Strings in the local_data_directory will be replaced: (md5 hashes have a standard length of 32 chars)

%working_path = the path specified in OUTPUT["working_path"]

%time_download(`) = current time at download; will be replaced with strftime() where is a string, explained further here: http://strftime.org/`


%time_execution()                 = current time at execution; will be replaced with strftime() where  is a string, explained further here: http://strftime.org/

%timestamp_download                     = current time at download; unix-timestamp
%timestamp_execution                    = current time at execution; unix-timestamp
%domain()                         = first  chars of the domain of the crawled file (e.g. zeit.de)
%appendmd5domain()               = appends the md5 to %domain(< - 32 (md5 length) - 1 ( as separator)>) if domain is longer than 
%md5_domain()                     = first  chars of md5 hash of %domain
%full_domain()                    = first  chars of the domain including subdomains (e.g. panamapapers.sueddeutsche.de)
%appendmd5_full_domain()          = appends the md5 to %fulldomain(< - 32 (md5 length) - 1 ( as separator)>) if full_domain is longer than 
%md5_full_domain()                = first  chars of md5 hash of %full_domain
%subdomains()                     = first  chars of the domain's subdomains
%appendmd5subdomains()           = appends the md5 to %subdomains(< - 32 (md5 length) - 1 ( as separator)>) if subdomains is longer than 
%md5_subdomains()                 = first  chars of md5 hash of %subdomains
%url_directory_string()           = first  chars of the directories on the server (e.g. http://panamapapers.sueddeutsche.de/articles/56f2c00da1bb8d3c3495aa0a/ would evaluate to articles_56f2c00da1bb8d3c3495aa0a), no filename
%appendmd5_url_directory_string() = appends the md5 to %url_directorystring(< - 32 (md5 length) - 1 ( as separator)>) if url_directory_string is longer than 
%md5_url_directory_string()       = first  chars of md5 hash of %url_directory_string()
%url_file_name()                  = first  chars of the file name (without type) on the server (e.g. http://www.spiegel.de/wirtschaft/soziales/ttip-dokumente-leak-koennte-ende-der-geheimhaltung-markieren-a-1090466.html would evaluate to ttip-dokumente-leak-koennte-ende-der-geheimhaltung-markieren-a-1090466, No filenames (indexes) will evaluate to index
%md5_url_file_name()              = first  chars of md5 hash of %url_file_name
%max_url_file_name                      = first x chars of %url_file_name, so the entire savepath has a length of the max possible length for a windows file system (260 characters - 1 )
%appendmd5_max_url_filename            = appends the md5 to the first x - 32 (md5 length) - 1 ( as separator) chars of %url_file_name if the entire savepath has a length longer than the max possible length for a windows file system (260 characters - 1 )
#
This path can be relative or absolute, though to be able to easily merge multiple data sets, it should be kept relative and consistent on all datasets.
To be able to use cleanup commands, it should also start with a static folder name like 'data'.
#
default: %working_path/data/%time_execution(%Y)/%time_execution(%m)/%time_execution(%d)/%appendmd5_full_domain(32)/%appendmd5_url_directorystring(60)%appendmd5_max_url_filename%timestamp_download.html
local_data_directory = %working_path/data/%time_execution(%Y)/%time_execution(%m)/%time_execution(%d)/%appendmd5_full_domain(32)/%appendmd5_url_directorystring(60)%appendmd5_max_url_filename%timestamp_download.html
Toggles whether leading './' or '.\' from above local_data_directory should be removed when saving the path into the Database
True: ./data would become data
default: True
format_relative_path = True
[MySQL]
MySQL-Connection required for saving meta-informations
host = localhost
port = 3306
db = ''
username = ''
password = ''
[Postgresql]
Postgresql-Connection required for saving meta-informations
host = localhost
port = 5432
database = 'news-please'
user = 'news-please'
password = 'password'
[Elasticsearch]
Elasticsearch-Connection required for saving detailed meta-information
host = localhost
port = 9200
index_current = 'news-please'
index_archive = 'news-please-archive'
Elasticsearch supports user authentication by CA certificates. If your database is protected by certificate
fill in the following parameters, otherwise you can ignore them.
use_ca_certificates = False
ca_cert_path = /path/to/cacert.pem
client_cert_path = /path/to/client_cert.pem
client_key_path = /path/to/client_key.pem
username = 'root'
secret = 'password'
Properties of the document type used for storage.
mapping = {"properties": {
"url": {"type": "text","fields":{"keyword":{"type":"keyword"}}},
"source_domain": {"type": "text","fields":{"keyword":{"type":"keyword"}}},
"title_page": {"type": "text","fields":{"keyword":{"type":"keyword"}}},
"title_rss": {"type": "text","fields":{"keyword":{"type":"keyword"}}},
"localpath": {"type": "text","fields":{"keyword":{"type":"keyword"}}},
"filename": {"type": "keyword"},
"ancestor": {"type": "keyword"},
"descendant": {"type": "keyword"},
"version": {"type": "long"},
"date_download": {"type": "date", "format":"yyyy-MM-dd HH:mm:ss"},
"date_modify": {"type": "date", "format":"yyyy-MM-dd HH:mm:ss"},
"date_publish": {"type": "date", "format":"yyyy-MM-dd HH:mm:ss"},
"title": {"type": "text","fields":{"keyword":{"type":"keyword"}}},
"description":  {"type": "text","fields":{"keyword":{"type":"keyword"}}},
"text": {"type": "text"},
"authors": {"type": "text","fields":{"keyword":{"type":"keyword"}}},
"image_url":  {"type": "text","fields":{"keyword":{"type":"keyword"}}},
"language": {"type": "keyword"}
}}
[ArticleMasterExtractor]
Choose which extractors you want to use.
#
The Default is ['newspaper_extractor', 'readability_extractor', 'date_extractor', 'lang_detect_extractor'],
which are all integrated extractors right now.
Possibly extractors are 'newspaper_extractor' , 'readability_extractor' , 'date_extractor_extractor and 'lang_detect_extractor'
Examples: -Only Newspaper and date_extractor: extractors = ['newspaper', 'date_extractor']
-Only Newspaper: extractors = ['newspaper']
extractors = ['newspaper_extractor', 'readability_extractor', 'date_extractor', 'lang_detect_extractor']
[DateFilter]
If added to the pipeline, this module provides the means to filter the extracted articles based on the publishing date.
Therefore this module has to be placed after the KM4 article extractor to access the publishing dates.
#
All articles, with a publishing date outside of the given time interval are dropped. The dates used to specify the
time interval are included and should follow this format: 'yyyy-mm-dd hh:mm:ss'.
#
It is also possible to only define one date, assigning the other variable the value 'None' to create an half-bounded
interval.
start_date = '1999-01-01 00:00:00'
end_date = '2999-12-31 00:00:00'
If 'True' articles without a publishing date are dropped.
strict_mode = False
[Scrapy]
Possible levels (must be UC-only): CRITICAL, ERROR, WARNING, INFO, DEBUG
default: WARNING
LOG_LEVEL = INFO
logformat, see https://docs.python.org/2/library/logging.html#logrecord-attributes
default: [%(name)s:%(lineno)d|%(levelname)s] %(message)s
LOG_FORMAT = [%(name)s:%(lineno)d|%(levelname)s] %(message)s
Can be a filename or None
default: None
LOG_FILE = None
LOG_DATEFORMAT = %Y-%m-%d %H:%M:%S
LOG_STDOUT = False
LOG_ENCODING = utf-8
BOT_NAME = 'news-please'
SPIDER_MODULES = ['newsplease.crawler.spiders']
NEWSPIDER_MODULE = 'newsplease.crawler.spiders'
Resume/Pause functionality activation
default: .resume_jobdir
JOBDIRNAME = .resume_jobdir
Respect robots.txt activation
default: True
ROBOTSTXT_OBEY=True
Maximum number of concurrent requests across all domains
default: 16
IMPORTANT: This setting does not work since each crawler has its own scrapy instance, but it might limit the concurrent_requests_per_domain if said setting has a higher number set than this one.
CONCURRENT_REQUESTS=16
Maximum number of active requests per domain
default:
CONCURRENT_REQUESTS_PER_DOMAIN=4
User-agent activation
default: 'news-please (+http://www.example.com/)'
USER_AGENT = 'news-please (+http://www.example.com/)'
Pipeline activation
Syntax: '.': <Order of execution from 0-1000>
default: {'newsplease.pipeline.pipelines.ArticleMasterExtractor':100, 'newsplease.crawler.pipeline.HtmlFileStorage':200, 'newsplease.pipeline.pipelines.JsonFileStorage': 300}
Further options: 'newsplease.pipeline.pipelines.ElasticsearchStorage': 350
ITEM_PIPELINES = {'newsplease.pipeline.pipelines.ArticleMasterExtractor':100,
'newsplease.pipeline.pipelines.HtmlFileStorage':200,
'newsplease.pipeline.pipelines.JsonFileStorage':300,
'newsplease.pipeline.pipelines.PostgresqlStorage':400
}
ITEM_CLASS = 'newsplease.crawler.items.NewscrawlerItem'
[Pandas]
file_name = "PandasStorage"

* Traceback

[scrapy.core.scraper:249|ERROR] Error processing {'abs_local_path': '/home/chris/news-please-repo/data/2020/12/13/stellenmarkt.faz.net/karriere-lounge1607862347.html',
'article_author': [],
'article_description': 'Mit unserem Karriere Ratgeber lernen Sie wertvolles '
'Wissen für Ihre Karriereplanung kennen und '
'profitieren von Selbstanalyse Inhalten.',
'article_image': 'https://stellenmarkt.faz.net/wp-content/themes/fazstm2-0/img/og-image.jpg',
'article_language': 'de',
'article_publish_date': None,
'article_text': 'Es ist endlich geschafft. Sie haben den ersten '
'Schulabschluss in der Tasche, die Euphorie ist groß und das '
'darf Sie auch sein. Nach so langer Zeit auf der Schulbank '
'darf man sich die Freude über das Ende dieses '
'Lebensabschnitts ruhig erlauben.\n'
'Irgendwann in unserem Leben kommen wir alle an den Punkt, an '
'dem wir uns entscheiden müssen, was wir mit unserer Zukunft '
'tun.',
'article_title': 'Karriere Ratgeber: Die Selbstanalyse',
'download_date': '2020-12-13 12:25:47',
'filename': 'karriere-lounge1607862347.html',
'html_title': b'Karriere Ratgeber: Die Selbstanalyse |F.A.Z. Stellenmarkt',
'local_path': '/home/chris/news-please-repo//data/2020/12/13/stellenmarkt.faz.net/karriere-lounge1607862347.html',
'modified_date': '2020-12-13 12:25:47',
'rss_title': 'NULL',
'source_domain': b'faz.net',
'spider_response': <200 https://stellenmarkt.faz.net/karriere-lounge/selbstanalyse/>,
'url': 'https://stellenmarkt.faz.net/karriere-lounge/selbstanalyse/'}
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, *kw)
File "/usr/local/lib/python3.8/dist-packages/scrapy/utils/defer.py", line 150, in f
return deferred_from_coro(coro_f(coro_args, **coro_kwargs))
File "/usr/local/lib/python3.8/dist-packages/newsplease/pipeline/pipelines.py", line 425, in process_item
old_version = self.cursor.fetchone()
psycopg2.ProgrammingError: no results to fetch
[scrapy.spidermiddlewares.httperror:53|INFO] Ignoring response <404 https://stellenmarkt.faz.net/tag/work-life-balance>: HTTP status code is not handled or not allowed
[scrapy.spidermiddlewares.httperror:53|INFO] Ignoring response <404 https://stellenmarkt.faz.net/tag/coach>: HTTP status code is not handled or not allowed
[newsplease.helper_classes.sub_classes.heuristics_manager:49|INFO] Checking site: https://stellenmarkt.faz.net/karriere-lounge/fuehrung/krisengebiete-als-karriereturbo/
[newsplease.pipeline.pipelines:524|INFO] Saving HTML to /home/chris/news-please-repo/data/2020/12/13/stellenmarkt.faz.net/karriere-lounge_fuehrung__1607862348.html
[newsplease.pipeline.pipelines:550|INFO] Saving JSON to /home/chris/news-please-repo/data/2020/12/13/stellenmarkt.faz.net/karriere-lounge_fuehrung1607862348.html.json
[newsplease.pipeline.pipelines:421|ERROR] Something went wrong in query: current transaction is aborted, commands ignored until end of transaction block
[scrapy.core.scraper:249|ERROR] Error processing {'abs_local_path': '/home/chris/news-please-repo/data/2020/12/13/stellenmarkt.faz.net/karriere-lounge_fuehrung1607862348.html',
'article_author': [],
'article_description': '',
'article_image': 'https://stellenmarkt.faz.net/wp-content/themes/fazstm2-0/img/og-image.jpg',
'article_language': 'de',
'article_publish_date': '2017-01-13 17:35:31',
'article_text': 'Das kann in einem rechtlichen Chaos enden. Wie findet man '
'sich dennoch zurecht?\n'
'Quelle: LAIF\n'
'Von Hendrik Wieduwilt\n'
'Für den Aufstieg in die Chefetage sind Einsätze im Ausland '
'recht förderlich. Wer sich in Missachtung jeder Reisewarnung '
'des Auswärtigen Amtes sogar in einem politisch riskanten '
'Schwellenland beruflich bewährt hat, kann womöglich auch die '
'unternehmerischen Risiken besser schultern, lautet die '
'gängige Vermutung. Bei Personalmanagern stehen „Expats“, wie '
'die erfahrenen Kandidaten genannt werden, daher hoch im '
'Kurs. Dabei kann es bei den Einsätzen im Ausland durchaus '
'riskant werden. Einer Umfrage der gemeinnützigen '
'Organisation International SOS zufolge senden 88 Prozent der '
'deutschen Unternehmen ihre Mitarbeiter regelmäßig in '
'medizinisch hochriskante Gebiete. Zwei Drittel der befragten '
'Arbeitgeber geben dabei an, dass solche Extratouren durchaus '
'für längere Zeit geplant sind.\n'
'Entsendungen mit einer durchschnittlichen Aufenthaltsdauer '
'von rund zwei Jahren liegen gegenwärtig im Trend, heißt es '
'in einer aktuellen Studie des Personalberatungsunternehmens '
'Mercer. Die Entwicklung begründet Beraterin Christa Zihlmann '
'mit dem Umstand, dass vor allem Klein- und Mittelbetriebe in '
'Deutschland international tätig sind und sich mit eigenen '
'Mitarbeitern in die Wachstumsmärkte vortasten.\n'
'Je nachdem, ob der jeweilige Absatzmarkt aus politischer '
'oder umwelttechnischer Sicht als Krisenregion eingestuft '
'wird, erwachsen dadurch unterschiedliche Risiken für den '
'Arbeitgeber. Angesichts der juristischen Folgen hat der '
'Spitzenverband der gesetzlichen Unfallversicherung zusammen '
'mit International SOS und der Großkanzlei Dentons eine '
'Übersicht über die Rechte und Pflichten deutscher '
'Unternehmen erstellt, die dieser Zeitung exklusiv vorliegt.\n'
'„Die Sicherheit und Gesundheit von Entsendeten in Ländern '
'wie Libyen, Nigeria, Pakistan oder im Irak zu gewährleisten '
'ist viel umfangreicher als in der Europäischen Union“, warnt '
'darin Stefan Eßer von International SOS, Mitautor des '
'Leitfadens. Viele Unternehmen würden verkennen, dass für '
'Auslandsstandorte bisweilen eine eigene medizinische '
'Infrastruktur geschaffen werden muss. Der Allgemeinmediziner '
'weist dabei auf die Unterschiede zwischen Stadt und Land in '
'vielen Staaten hin. Während beispielsweise in einer '
'Metropole wie Bangkok eine intakte medizinische Versorgung '
'gewährleistet sei, sehe das in ländlichen '
'Produktionsstandorten Thailands womöglich ganz anders aus. '
'Ähnliches gelte für die aufstrebenden Wachstumsmärkte in '
'Brasilien, China, Russland oder Indien.\n'
'Dagegen sind innerhalb Europas die juristischen wie '
'wirtschaftlichen Risiken überschaubar. Bei rechtlichen '
'Belangen helfen mehrere EU-Verordnungen weiter. Danach gilt '
'das jeweilige Sozialversicherungsrecht eines '
'Mitgliedstaates, für Expats in Deutschland also meist das '
'deutsche. Damit haftet nicht der beitragszahlende '
'Arbeitgeber, sondern die nationale Sozialversicherung. '
'Außerhalb der EU können internationale Abkommen '
'weiterhelfen, mit ähnlichen Rechtsfolgen, wie sie auch im '
'europäischen Ausland gelten.\n'
'Fehlen solche Verordnungen oder Abkommen, verheddern sich '
'die Rechtsordnungen der beteiligten Staaten meist wie '
'Kopfhörerkabel: Ein doppelter Versicherungsschutz kann etwa '
'die Folge sein, heißt es in dem Leitfaden. Für den größten '
'Teil der befristeten Entsendungen hat jedoch das deutsche '
'Sozialversicherungsrecht seine Gültigkeit.\n'
'Der Arbeitgeber muss sich auch im praktischen Arbeitsalltag '
'um seine Mitarbeiter kümmern, oder, wie es juristisch heißt, '
'seine „Fürsorge- und Treuepflicht erfüllen“. Verletzt er '
'diese Pflichten, kann der Arbeitnehmer auf Schadensersatz '
'klagen und seine Arbeit unter Umständen vorübergehend '
'einstellen. Juristen sprechen dann von einem '
'„Zurückbehaltungsrecht“. Die Autoren weisen dabei auf einen '
'wichtigen Zusammenhang hin: Je gefährlicher das Zielland '
'eingestuft werde, desto stärker müsse sich der Arbeitgeber '
'um seine Mitarbeiter vor Ort kümmern. Die Fürsorge sollte '
'sich auf die Zeit vor, während und nach dem '
'Auslandsaufenthalt beziehen.\n'
'Danach sollten die Unternehmen ihre Statthalter vor Ort '
'möglichst vorab über Gesundheitsrisiken und '
'Impfschutzmaßnahmen informieren und auf die sozial- und '
'steuerrechtlichen Risiken hinweisen. Das gilt vor allem '
'dann, wenn der Arbeitgeber ohnehin über erhebliche '
'Auslandserfahrung verfügt. Die Pflichten vor Ort schwanken '
'je nach Umfang des Auslandseinsatzes oder dem Risikoprofil '
'des Zielortes. Sie reichen vom Abschluss einer zusätzlichen '
'Krankenversicherung innerhalb Europas (Im Gespräch: „Expats '
'brauchen kundige Partner“) bis hin zum Personenschutz von '
'Mitarbeitern, wie er in einem Krisenland wie Nigeria seit '
'Jahren üblich ist.\n'
'Die Autoren des Leitfadens warnen Unternehmen nicht nur vor '
'den Haftungsansprüchen, die von Mitarbeitern im Falle einer '
'Pflichtverletzung gestellt werden. Patzt der Arbeitgeber '
'zudem regelmäßig beim Umgang seiner Manager im Ausland, '
'nimmt auch sein öffentliches Ansehen auf Dauer Schaden: Denn '
'wer möchte schon für ein Unternehmen arbeiten, das seine '
'Mitarbeiter in heiklen Situationen im Stich lässt, lautet '
'ein warnender Hinweis im Leitfaden.\n'
'Eine „Treuepflicht“ hat der Arbeitnehmer im Ausland freilich '
'auch gegenüber dem Arbeitgeber. So gelten die Statthalter '
'vor Ort als die natürlichen Repräsentanten eines '
'Unternehmens, die die jeweiligen Sitten und Gebräuche ihres '
'Gastlandes beachten sollten. Dabei bezieht sich ein solcher '
'Verhaltenskodex nicht nur auf das berufliche Umfeld, sondern '
'auch auf Teilbereiche der privaten Sphäre. In dem Leitfaden '
'für Expats ist daher von einem respektvollen und (der '
'jeweiligen Kultur) angemessenen Umgang mit Kunden die Rede. '
'Dort ist aber auch der Hinweis zu finden, dass sich für '
'ausländische Manager der übermäßige Alkoholkonsum in der '
'Öffentlichkeit oder bei privaten Festen verbietet, wenn sich '
'der Arbeitsplatz in einem islamischen Land befindet.\n'
'Die wichtigsten Adressaten des Leitfadens sind Juristen, die '
'auf Arbeitsrecht spezialisiert und mit Personalfragen in '
'Unternehmen betraut sind. Für Führungskräfte ohne '
'arbeitsrechtliche Kenntnisse führt die Lektüre des '
'Leitfadens wohl eher zum festen Vorsatz, einen einschlägigen '
'Fachmann für das komplexe Vorhaben zu verpflichten.\n'
'Auch das Ende eines befristeten Auslandseinsatzes will gut '
'vorbereitet sein. Viele Rückkehrer müssen rechtzeitig mit '
'neuen Projekten bedacht oder aber behutsam in eine Zentrale '
'integriert werden. Im Arbeitsalltag vieler Unternehmen führt '
'nicht der Auslandsaufenthalt selbst, sondern vielmehr der '
'Umstand, wie die Rückkehr organisiert ist, zu internen '
'Reibungsverlusten, räumen erfahrene Personalmanager ein.\n'
'Das Arbeitsrecht verpflichtet Unternehmen zwar, die Rückkehr '
'ins normale Berufsleben zu ermöglichen. Dabei sollte der '
'Arbeitnehmer im Blick haben, dass das Abenteuer im Ausland '
'meist Spuren in der Persönlichkeit des Mitarbeiters '
'hinterlässt: „Es kommt nicht der gleiche Mensch zurück“, '
'berichtet Steffen Görres aus der Kanzlei Bryan Cave in '
'Hamburg. Um böse Überraschungen für alle Beteiligten zu '
'vermeiden, empfiehlt der Anwalt aus Hamburg den '
'Personalmanagern, das Gespräch mit den Expats über ihre '
'künftigen Aufgaben frühzeitig aufzunehmen. Auch im '
'Entsendungsvertrag sollte die Rückkehr so konkret wie '
'möglich geregelt werden. Bleibt die Vorsorge aus, seien '
'weitere Konflikte programmiert, warnt Görres.',
'article_title': 'Krisengebiete als Karriereturbo',
'download_date': '2020-12-13 12:25:48',
'filename': 'karriere-lounge_fuehrung1607862348.html',
'html_title': b'Krisengebiete als Karriereturbo - F.A.Z. Stellenmarkt',
'local_path': '/home/chris/news-please-repo//data/2020/12/13/stellenmarkt.faz.net/karriere-lounge_fuehrung1607862348.html',
'modified_date': '2020-12-13 12:25:48',
'rss_title': 'NULL',
'source_domain': b'faz.net',
'spider_response': <200 https://stellenmarkt.faz.net/karriere-lounge/fuehrung/krisengebiete-als-karriereturbo/>,
'url': 'https://stellenmarkt.faz.net/karriere-lounge/fuehrung/krisengebiete-als-karriereturbo/'}
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/twisted/internet/defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, *kw)
File "/usr/local/lib/python3.8/dist-packages/scrapy/utils/defer.py", line 150, in f
return deferred_from_coro(coro_f(coro_args, **coro_kwargs))
File "/usr/local/lib/python3.8/dist-packages/newsplease/pipeline/pipelines.py", line 425, in process_item
old_version = self.cursor.fetchone()
psycopg2.ProgrammingError: no results to fetch
[scrapy.spidermiddlewares.httperror:53|INFO] Ignoring response <404 https://gutscheine.faz.net/lufthansa-com>: HTTP status code is not handled or not allowed
[newsplease.helper_classes.sub_classes.heuristics_manager:49|INFO] Checking site: https://stellenmarkt.faz.net/karriere-lounge/management/teure-rufschaedigung-durch-frustrierte-fruehere-kollegen/
[newsplease.pipeline.pipelines:524|INFO] Saving HTML to /home/chris/news-please-repo/data/2020/12/13/stellenmarkt.faz.net/karriere-lounge_management1607862348.html
[newsplease.pipeline.pipelines:550|INFO] Saving JSON to /home/chris/news-please-repo/data/2020/12/13/stellenmarkt.faz.net/karriere-lounge_management__1607862348.html.json
[newsplease.pipeline.pipelines:421|ERROR] Something went wrong in query: current transaction is aborted, commands ignored until end of transaction block


**To Reproduce**
- Install news-please via pip3 (pip3 install news-please)
- Install psycopg2 and uninstall psycop2-binary (pip3 uninstall psycopg2-binary and pip3 install psycopg2)
- Edit config.cfg file and add in relevant postgresql credentails and add in postgresql pipeline 
- From within psql create database then run the provided setup script provided to create the tables and columns for the database you created
- Run news-please via bash 

**Expected behavior**
No such error message provided and news-please should start filling postgresql with data on news articles

**Log**
n/a

**Versions (please complete the following information):**
 - OS: Debian 10
 - Python Version 3.86
 - news-please Version 1.5.13

**Intent (optional; we'll use this info to prioritize upcoming tasks to work on)**
* [x ] personal
* [ ] academic
* [ ] business
* [ ] other
* Some information on your project: 
Attempting to develop an osint tool to search for news articles

fhamborg / news-please

Adding Postgresql pipeline in config.cfg gives error "psycopg2.ProgrammingError: no results to fetch error" when running crawler #187

GENERAL

-------

Crawling heuristics

Default Crawlers:

Possibilities: RecursiveCrawler, RecursiveSitemapCrawler, RssCrawler, SitemapCrawler, Download (./newsplease/crawler/spiders/-dir)

default: SitemapCrawler

default:

fallbacks = {

"RssCrawler": None,

"RecursiveSitemapCrawler": "RecursiveCrawler",

"SitemapCrawler": "RecursiveCrawler",

"RecursiveCrawler": None,

"Download": None

}

Determines how many hours need to pass since the last download of a webpage

to be downloaded again by the RssCrawler

default: 6

PROCESSES

---------

Number of crawlers, that should crawl parallel

not counting in daemonized crawlers

default: 5

Number of daemons, will be added to daemons.

default: 10

SPECIAL CASES

-------------

urls which end on any of the following file extensions are ignored for recursive crawling

default: "(pdf)|(docx?)|(xlsx?)|(pptx?)|(epub)|(jpe?g)|(png)|(bmp)|(gif)|(tiff)|(webp)|(avi)|(mpe?g)|(mov)|(qt)|(webm)|(ogg)|(midi)|(mid)|(mp3)|(wav)|(zip)|(rar)|(exe)|(apk)|(css)"

urls which match the following regex are ignored for recursive crawling

default: ""

Crawl the sitemaps of subdomains (if sitemap is enabled)

If True, any SitemapCrawler will try to crawl on the sitemap of the given domain including subdomains instead of a domain's main sitemap.

e.g. if True, a SitemapCrawler to be started on https://blog.zeit.de will try to crawl on the sitemap listed in http://blog.zeit.de/robots.txt. If not found, it will fall back to the False setting.

if False, a SitemapCrawler to be started on https://blog.zeit.de will try to crawl on the sitemap listed in http://zeit.de/robots.txt

default: True

Enabled heuristics,

Currently:

- og_type

- linked_headlines

- self_linked_headlines

- is_not_from_subdomain (with this setting enabled, it can be assured that only pages that aren't from a subdomain are downloaded)

- meta_contains_article_keyword

- crawler_contains_only_article_alikes

(maybe not up-to-date, see ./newsplease/helper_classes/heursitics.py:

Every method not starting with __ should be a heuristic, except is_article)

These heuristics can be overwritten by sitelist.json for each site

default: {"og_type": True, "linked_headlines": "<=0.65", "self_linked_headlines": "<=0.56"}

Heuristics can be combined with others

The heuristics need to have the same name as in enabled_heuristics

Possible condition-characters / literals are: (, ), not, and, or

All heuristics used here need to be enabled in enabled_heuristics as well!

Examples:

"og_type and (self_linked_headlines or linked_headlines)"

"og_type"

default: "og_type and (linked_headlines or self_linked_headlines)"

The maximum ratio of headlines divided by linked_headlines in a file

The minimum number of headlines in a file to check for the ratio

If less then this number are in the file, the file will pass the test.

default: 5

GENERAL:

-------

Paths:

toggles relative paths to be relative to the start_processes.py script (True) or relative to this config file (False)

This does not work for this config's 'Scrapy' section which is always relative to the dir the start_processes.py script is called from

Default: True

INPUT:

-----

Here you can specify the input JSON-Filename

default: sitelist.hjson

OUTPUT:

------

Toggles whether leading './' or '.\' from above local_data_directory should be removed when saving the path into the Database

True: ./data would become data

default: True

Following Strings in the local_data_directory will be replaced: (md5 hashes have a standard length of 32 chars)

%working_path = the path specified in OUTPUT["working_path"]

%time_download() = current time at download; will be replaced with strftime() where is a string, explained further here: http://strftime.org/

%time_execution() = current time at execution; will be replaced with strftime() where is a string, explained further here: http://strftime.org/

%time_download(`) = current time at download; will be replaced with strftime() where is a string, explained further here: http://strftime.org/`

%time_execution(`) = current time at execution; will be replaced with strftime() where is a string, explained further here: http://strftime.org/`