datavolo-io / hatch-datavolo-nar

Hatch plugin for building Apache NiFi NAR bundles
Apache License 2.0
12 stars 4 forks source link

help with ModuleNotFoundError? #4

Closed snowch closed 1 month ago

snowch commented 1 month ago

I'll try to create a minimal reproducible example that doesn't have the vastdb dependency, but in the meantime raising here in case anyone else has come across this issue.

The processor works as a python script dropped in python/extensions. I get the error below when trying to deploy as a nar.

The issue I'm encountering is very likely to be user error - this is the first time I've used hatch.

My nifi logs are showing the following:

2024-09-01 16:55:44,073 ERROR [Initialize Python Processor ae835b2d-0191-1000-ffff-fffffeaa9acd (PutVastDB)] o.a.n.py4j.StandardPythonProcessorBridge Failed to load code for Python Processor ae835b2d-0191-1000-ffff-fffffeaa9acd (PutVastDB). Will try again in 16000 millis
py4j.Py4JException: An exception was raised by the Python Proxy. Return Message: Traceback (most recent call last):
  File "/stackable/nifi-2.0.0-M4/python/framework/py4j/java_gateway.py", line 2466, in _call_proxy
    return_value = getattr(self.pool[obj_id], method)(*params)
  File "/stackable/nifi-2.0.0-M4/./python/framework/Controller.py", line 70, in createProcessor
    processorClass = self.extensionManager.getProcessorClass(processorType, version, work_dir)
  File "/stackable/nifi-2.0.0-M4/python/framework/ExtensionManager.py", line 105, in getProcessorClass
    processor_class = self.__load_extension_module(module_file, details.local_dependencies)
  File "/stackable/nifi-2.0.0-M4/python/framework/ExtensionManager.py", line 372, in __load_extension_module
    module_spec.loader.exec_module(module)
  File "<frozen importlib._bootstrap_external>", line 850, in exec_module
  File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
  File "/stackable/nifi-2.0.0-M4/./work/nar/extensions/PutVastDB-0.1.0.nar-unpacked/PutVastDB.py", line 21, in <module>
    import pyarrow as pa
  File "/stackable/nifi-2.0.0-M4/./work/nar/extensions/PutVastDB-0.1.0.nar-unpacked/NAR-INF/bundled-dependencies/pyarrow/__init__.py", line 65, in <module>
    import pyarrow.lib as _lib
ModuleNotFoundError: No module named 'pyarrow.lib'

        at py4j.Protocol.getReturnValue(Protocol.java:476)
        at org.apache.nifi.py4j.client.PythonProxyInvocationHandler.invoke(PythonProxyInvocationHandler.java:87)
        at jdk.proxy6/jdk.proxy6.$Proxy59.createProcessor(Unknown Source)
        at org.apache.nifi.py4j.PythonProcess$1.createProcessor(PythonProcess.java:468)
        at org.apache.nifi.py4j.StandardPythonProcessorBridge.initializePythonSide(StandardPythonProcessorBridge.java:145)
        at org.apache.nifi.py4j.StandardPythonProcessorBridge.lambda$initialize$0(StandardPythonProcessorBridge.java:78)
        at java.base/java.lang.VirtualThread.run(VirtualThread.java:329)

my pyproject.toml:

[build-system]
requires = [
    "hatchling",  
    "hatch-datavolo-nar"
]
build-backend = "hatchling.build"

[project]
name = "PutVastDB"
dynamic = ["version"]
requires-python = ">=3.11"
description = "Publishes JSON data to a Vast DB."
authors = [
    { name="Chris Snow", email="" }
]
dependencies = [
    "vastdb",
    "pyarrow",
]

[tool.hatch.version]
path = "__init__.py"

[tool.hatch.build.targets.nar]
packages = ["."]

My processor imports:

from nifiapi.properties import PropertyDescriptor, StandardValidators
from nifiapi.flowfiletransform import FlowFileTransform, FlowFileTransformResult
import io
import vastdb
import logging
import pyarrow as pa
from pyarrow import json as pa_json
import json

class PutVastDB(FlowFileTransform):
    class Java:
        implements = ['org.apache.nifi.python.processor.FlowFileTransform']
...

The slimmed down list of files in the nar:

Archive:  dist/PutVastDB-0.1.0.nar
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  01-01-1980 00:00   META-INF/
      153  09-01-2024 11:36   META-INF/MANIFEST.MF
       30  09-01-2024 02:31   .env
     8270  08-25-2024 21:26   PutVastDB.py
       21  09-01-2024 01:55   __init__.py
      449  09-01-2024 02:25   pyproject.toml
        0  01-01-1980 00:00   .vscode/
      571  08-23-2024 09:49   .vscode/launch.json
        0  01-01-1980 00:00   NAR-INF/
        0  01-01-1980 00:00   NAR-INF/bundled-dependencies/
    19094  09-01-2024 11:37   NAR-INF/bundled-dependencies/xmltodict.py
    34549  09-01-2024 11:37   NAR-INF/bundled-dependencies/six.py
   134451  09-01-2024 11:37   NAR-INF/bundled-dependencies/typing_extensions.py

...

  3858168  09-01-2024 11:37   NAR-INF/bundled-dependencies/pyarrow/lib.cpython-311-darwin.so
...
---------                     -------
270777417                     13283 files
snowch commented 1 month ago

I switched the build to a linux environment, but same error. The new bundled pyarrow.lib:

  4585648  2024-09-02 15:06   NAR-INF/bundled-dependencies/pyarrow/lib.cpython-312-x86_64-linux-gnu.so

This is building on Python 3.12 - I think that is now the issue.

Unfortunately, my NiFi runtime is 3.9 so I think I'll need to figure out how to build the nar manually on Python 3.9.

snowch commented 1 month ago

I hacked the hatch-datavolo-nar project to support 3.9 and built on a linux environment and the processor works.