Open joeyzhou98 opened 4 years ago
Thanks for reporting.
I suspect --is-pipeline doesn't get much use, and it doesn't look to be covered by tests. load_pipeline_from_module() has a bit that's supposed to handle these relative imports:
dirname_ = dirname(module)
assert(module.endswith('.py'))
try:
sys.path.insert(0, dirname_)
modname = basename(module)[:-3]
# to allow for relative imports within "stock" pipelines
if dirname_ == opj(dirname(__file__), 'pipelines'):
mod = __import__('datalad_crawler.pipelines.%s' % modname,
fromlist=['datalad_crawler.pipelines'])
else:
mod = __import__(modname, level=0)
The problem is that we don't go down the if-arm because the condition assumes __file__
will be a relative path, which isn't necessarily the case. As a quick and dirty fix, we can work around this with
diff --git a/datalad_crawler/pipeline.py b/datalad_crawler/pipeline.py
index a23c70e..4f117c3 100644
--- a/datalad_crawler/pipeline.py
+++ b/datalad_crawler/pipeline.py
@@ -50,6 +50,7 @@
import sys
from glob import glob
+from os.path import abspath
from os.path import dirname, join as opj, isabs, exists, curdir, basename
from os import makedirs
@@ -391,7 +392,7 @@ def load_pipeline_from_module(module, func=None, args=None, kwargs=None, return_
sys.path.insert(0, dirname_)
modname = basename(module)[:-3]
# to allow for relative imports within "stock" pipelines
- if dirname_ == opj(dirname(__file__), 'pipelines'):
+ if abspath(dirname_) == opj(abspath(dirname(__file__)), 'pipelines'):
mod = __import__('datalad_crawler.pipelines.%s' % modname,
fromlist=['datalad_crawler.pipelines'])
else:
But that just gets us to another failure:
$ datalad crawl --is-pipeline datalad_crawler/pipelines/nda.py
[INFO ] Loading pipeline definition from datalad_crawler/pipelines/nda.py
[ERROR ] Failed to import pipeline from datalad_crawler/pipelines/nda.py: pipeline() missing 1 required positional argument: 'collection' [pipeline.py:load_pipeline_from_module:402] [pipeline.py:load_pipeline_from_module:404] (RuntimeError)
So --is-pipeline needs some attention.
@kyleam thanks for the quick reply!
How would you test new or existing pipelines as in what are the commands to execute them?
As I am trying to write a new crawler for Zenodo, I was trying to find a way to test and execute existing pipeline to observe expected behavior. The problem is when executing
datalad crawl --is-pipeline datalad_crawler/pipelines/<pipeline>.py
There seems to be an relative import error. So my question is how do we successfully test crawling existing pipelines with--is-pipeline
flag? I tested multiple different paths and all gave me the same error:[ERROR ] Failed to import pipeline from datalad_crawler/pipelines/nda.py: attempted relative import with no known parent package [nda.py:<module>:13] [pipeline.py:load_pipeline_from_module:403] (RuntimeError)
I chose randomly nda.py as a pipeline for testing.Edit: It would be great if the documentation could be more in-depth