datalad-datasets / ratholeradio-archive

git-annex archive of all episodes from http://ratholeradio.org with .cue sheets
2 stars 1 forks source link

Crawler pipeline broken #2

Open mih opened 6 years ago

mih commented 6 years ago

The pipeline doesn't work anymore:

/tmp/ratholeradio-archive (git)-[master] % datalad crawl
[INFO   ] Loading pipeline definition from ./.datalad/crawl/pipelines/pipeline.py 
[ERROR  ] Failed to import pipeline from ./.datalad/crawl/pipelines/pipeline.py: No module named 'datalad.crawler' [pipeline.py:<module>:39] [pipeline.py:load_pipeline_from_module:403] (RuntimeError) 

I tried a cheap

diff --git a/.datalad/crawl/pipelines/pipeline.py b/.datalad/crawl/pipelines/pipeline.py
index 5e0618a..f14bcef 100644
--- a/.datalad/crawl/pipelines/pipeline.py
+++ b/.datalad/crawl/pipelines/pipeline.py
@@ -36,10 +36,10 @@ import re
 from os.path import join as opj, basename

 from datalad.utils import updated
-from datalad.crawler.nodes.annex import Annexificator
-from datalad.crawler.nodes.crawl_url import crawl_url
-from datalad.crawler.nodes.misc import sub
-from datalad.crawler.nodes.matches import a_href_match, css_match
+from datalad_crawler.nodes.annex import Annexificator
+from datalad_crawler.nodes.crawl_url import crawl_url
+from datalad_crawler.nodes.misc import sub
+from datalad_crawler.nodes.matches import a_href_match, css_match

 from logging import getLogger
 lgr = getLogger('datalad.custom.ratholeradio')

but that only leads to

% datalad crawl
[INFO   ] Loading pipeline definition from ./.datalad/crawl/pipelines/pipeline.py 
[INFO   ] Creating a pipeline for the ratholeradio.org podcasts 
[INFO   ] Running pipeline [[<datalad_crawler.nodes.crawl_url.crawl_url object at 0x7f8831b96f60>, a_href_match(query=<<'.*/(?P<year>2[0-9]{3}...>>), <datalad_crawler.nodes.crawl_url.crawl_url object at 0x7f8831b96f98>, [sub(ok_missing=False, subs=<<{'response': {'</?stro...>>), css_match(query='div#page .entry'), css_match(query='div#page .entry'), <function process_episode at 0x7f883d05b620>, <datalad_crawler.nodes.annex.Annexificator object at 0x7f884044ee10>]], <bound method Annexificator.finalize of <datalad_crawler.nodes.annex.Annexificator object at 0x7f884044ee10>>] 
[INFO   ] Fetching 'http://ratholeradio.org' 
[WARNING] Failed to open cookies DB /home/mih/.config/datalad/cookies: db type could not be determined [__init__.py:open:88] 
[WARNING] Failed to check for having a cookie for http://ratholeradio.org: argument of type 'NoneType' is not iterable [cookies.py:__contains__:85] 
[ERROR  ] 'function' object is not iterable [pipeline.py:xrun_pipeline_steps:270] (TypeError) 
yarikoptic commented 6 years ago

Yeah, there was some code breakage since then and cookies db not usable across python releases is known issue But I don't think there is any new episode yet