SDM-TIB / SDM-RDFizer

An Efficient RML-Compliant Engine for Knowledge Graph Construction
https://doi.org/10.5281/zenodo.3872103
Apache License 2.0
107 stars 25 forks source link

Improve packaging of the rdfizer, and enable the docker container to be used as CLI #72

Closed vemonet closed 2 years ago

vemonet commented 2 years ago

Hi @dachafra and @eiglesias34 , I made some improvement to the python package and docker deployment, let me know if you'll be interested to integrate them

If you are interested I'll add more detail about how to run with docker and/or pip package to the README.md (and we can maybe replace the current Dockerfile with this new one to avoid confusion)

Changes made

Motivations

The official README recommend to install the pip package and use it as a module with python -m:

python3 -m pip install rdfizer
python3 -m rdfizer -c /path/to/config/file

On the side it is also documented to deploy it with Docker when looking into the GitHub repo Wiki (and it is often necessary to avoid conflict, since RDFizer require rdflib 4 which is quite old)

Moreover the workflow used by the docker deployment is more complex than it needs to be:

What it needs to do: call the semantify() function with the config file path.

What is currently does: a Dockerfile is built without using the actual rdfizer package (evem if it is the recommended way to use according to the docs), the Docker image uses an app.py script to start an API. When triggered with a curl call, the API runs a system call to run another run_rdfizer.py python script that finally runs the semantify() function

@app.route('/graph_creation/<path:config_file>', methods=['GET','POST'])
def rdfgraph(config_file):
    os.system("python3 /app/rdfizer/run_rdfizer.py /" + config_file)
    return "The file has been semantified " + str(config_file) + "\n"

Note that the current Dockerfile uses python:3.5 which is not supported anymore, and it also contradicts the requirements of the package in setup.py (so the package cannot be installed in the current dockerfile):

python_requires='>=3.6',

I used python:3.8 in the new Dockerfile and it seems to work fineThis adds a lot of complexity without improving the reproducibility of the software, and it also create 2 different deployment methods to maintainI improved the rdfizer package to make it a CLI, so you don't need to call it with python3 -m everytime when installing it locally (this just requires to add an entrypoint to the setup.py )I added a new Dockerfile.cli image that build from the python package and can be run as a CLI command

Usage

Build:

docker build -t ghcr.io/vemonet/rdfizer:latest -f Dockerfile.cli .

Run (change $(pwd) by ${PWD} on windows to use the current working folder):

docker run -it --rm -v $(pwd):/data ghcr.io/vemonet/rdfizer:latest -c config.ini

The rdfizer can still be run as python3 -m rdfizer -c config.ini , but it can be also run as rdfizer -c config.ini , or directly docker runYou can also start the API from my docker image:

docker run -it --rm -v $(pwd)/example:/data --entrypoint python ghcr.io/vemonet/rdfizer:latest /app/app.py

Let me know if you are interested in those types of changes, and if there is anything you would like to see differently