datalad / datalad-crawler

DataLad extension for tracking web resources as datasets
http://datalad.org
Other
5 stars 16 forks source link

--is-pipeline relative import error #52

Open joeyzhou98 opened 4 years ago

joeyzhou98 commented 4 years ago

As I am trying to write a new crawler for Zenodo, I was trying to find a way to test and execute existing pipeline to observe expected behavior. The problem is when executing datalad crawl --is-pipeline datalad_crawler/pipelines/<pipeline>.py There seems to be an relative import error. So my question is how do we successfully test crawling existing pipelines with --is-pipeline flag? I tested multiple different paths and all gave me the same error: [ERROR ] Failed to import pipeline from datalad_crawler/pipelines/nda.py: attempted relative import with no known parent package [nda.py:<module>:13] [pipeline.py:load_pipeline_from_module:403] (RuntimeError) I chose randomly nda.py as a pipeline for testing.

image

Edit: It would be great if the documentation could be more in-depth

kyleam commented 4 years ago

Thanks for reporting.

I suspect --is-pipeline doesn't get much use, and it doesn't look to be covered by tests. load_pipeline_from_module() has a bit that's supposed to handle these relative imports:

    dirname_ = dirname(module)
    assert(module.endswith('.py'))
    try:
        sys.path.insert(0, dirname_)
        modname = basename(module)[:-3]
        # to allow for relative imports within "stock" pipelines
        if dirname_ == opj(dirname(__file__), 'pipelines'):
            mod = __import__('datalad_crawler.pipelines.%s' % modname,
                             fromlist=['datalad_crawler.pipelines'])
        else:
            mod = __import__(modname, level=0)

The problem is that we don't go down the if-arm because the condition assumes __file__ will be a relative path, which isn't necessarily the case. As a quick and dirty fix, we can work around this with

diff --git a/datalad_crawler/pipeline.py b/datalad_crawler/pipeline.py
index a23c70e..4f117c3 100644
--- a/datalad_crawler/pipeline.py
+++ b/datalad_crawler/pipeline.py
@@ -50,6 +50,7 @@

 import sys
 from glob import glob
+from os.path import abspath
 from os.path import dirname, join as opj, isabs, exists, curdir, basename
 from os import makedirs

@@ -391,7 +392,7 @@ def load_pipeline_from_module(module, func=None, args=None, kwargs=None, return_
         sys.path.insert(0, dirname_)
         modname = basename(module)[:-3]
         # to allow for relative imports within "stock" pipelines
-        if dirname_ == opj(dirname(__file__), 'pipelines'):
+        if abspath(dirname_) == opj(abspath(dirname(__file__)), 'pipelines'):
             mod = __import__('datalad_crawler.pipelines.%s' % modname,
                              fromlist=['datalad_crawler.pipelines'])
         else:

But that just gets us to another failure:

$ datalad crawl --is-pipeline datalad_crawler/pipelines/nda.py
[INFO   ] Loading pipeline definition from datalad_crawler/pipelines/nda.py 
[ERROR  ] Failed to import pipeline from datalad_crawler/pipelines/nda.py: pipeline() missing 1 required positional argument: 'collection' [pipeline.py:load_pipeline_from_module:402] [pipeline.py:load_pipeline_from_module:404] (RuntimeError) 

So --is-pipeline needs some attention.

joeyzhou98 commented 4 years ago

@kyleam thanks for the quick reply!

How would you test new or existing pipelines as in what are the commands to execute them?

kyleam commented 4 years ago

what are the commands to execute them?

Have you tried following the demo here?