datacleaner / DataCleaner

The premier open source Data Quality solution
GNU Lesser General Public License v3.0
598 stars 181 forks source link

Unable to run data cleaner image on docker #1824

Closed isethy closed 5 years ago

isethy commented 5 years ago

I have 5.1.5 data cleaner installed on my mac system. I am now trying to run the data cleaner image on docker but it is failing. I am following this link for doing this.

https://github.com/kaspersorensen/datacleaner-docker

Error Message: hostname:~ userxyz$ docker run --rm kaspersor/datacleaner --usage Unable to find image 'kaspersor/datacleaner:latest' locally latest: Pulling from kaspersor/datacleaner 88286f41530e: Pull complete 009f6e766a1b: Pull complete 86ed68184682: Pull complete 81f6e59b92a9: Pull complete 6a25db359c8f: Pull complete Digest: sha256:243ca722931a1b9489798b3fdf6e2a220ef801080195f9c00109dee8d2ffcdae Status: Downloaded newer image for kaspersor/datacleaner:latest INFO 16:47:07 DataCleanerHome - Initializing DATACLEANER_HOME INFO 16:47:07 DataCleanerHome - Running in standard mode. Failed to load DataCleaner version from manifest: null Failed to load DataCleaner version from manifest: null INFO 16:47:07 DataCleanerHome - Attempting to build DATACLEANER_HOME in user.home: /root/.datacleaner/5.2.2 -> file:///root/.datacleaner/5.2.2 INFO 16:47:07 DataCleanerHome - Folder file:///root/.datacleaner/5.2.2 does not exist. Trying to create it. INFO 16:47:07 DataCleanerHome - Folder file:///root/.datacleaner/5.2.2 created successfully. Attempting to build DATACLEANER_HOME here. INFO 16:47:07 DataCleanerHomeUpgrader - Did not find a suitable upgrade candidate Using default log configuration: jar:file:/opt/DataCleaner/DataCleaner.jar!/org/datacleaner/log4j-default.xml -conf (-configuration, --configuration-file) PATH : Path to an XML file describing the configuration of DataCleaner -ds (-datastore, --datastore-name) VAL : Name of datastore when printing a list of schemas, tables or columns. Overrides datastore used when used with -job -job (--job-file) PATH : Path to an analysis job XML file to execute -list [ANALYZERS | TRANSFORMERS | FILTERS | DATASTORES | : Used to print a list of various elements available in the SCHEMAS | TABLES | COLUMNS] : configuration -of (--output-file) PATH : Path to file in which to save the result of the job -ot (--output-type) [TEXT | HTML | SERIALIZED] : How to represent the result of the job -properties (--properties-file) PATH : Path to a custom properties file -runtype (--runtype) [LOCAL | SPARK] : How/where to run the job -s (-schema, --schema-name) VAL : Name of schema when printing a list of tables or columns -t (-table, --table-name) VAL : Name of table when printing a list of columns hostname:~ userxyz$ docker imges docker: 'imges' is not a docker command. See 'docker --help'

Is this issue because I have 5.1. Do I need to install 5.2 to get this image running. Please help

LosD commented 5 years ago

This looks correct to me? You're asking for usage, and DataCleaner is showing you how it's run (along with quite a bit of noise. Not entirely sure why that is printed to the console by default).

Afterwards, you're making a misspelled docker images command, hence the error message.

LosD commented 5 years ago

Ah. Just read the instructions. Though you can deduct it from the path, it actually fails in getting the version.

Not that it matters, though, DataCleaner itself seems to be starting properly.

isethy commented 5 years ago

username$ docker images kaspersor/datacleaner latest 03d51b437517 21 months ago 258MB

This is what I see when i check the image. Does this looks good?

LosD commented 5 years ago

I think so, but the most important test is if it can actually run a job when given a configuration file and a job (as well as the data specified in the conf file, of course)

isethy commented 5 years ago

Thanks @LosD I have created a data cleaner job.. I am ready to run it now. This is the command that I need to modify and would have to run.. docker run --rm -v ~/.datacleaner/5.2.2:/dc_data kaspersor/datacleaner -conf /dc_data/conf.xml -job /dc_data/jobs/myjob.analysis.xml

I do not know what is the equivalent of -> -v ~/.datacleaner/5.2.2:/dc_data How do I modify this command to run my .xml file?

Additional Info:

username$ pwd /Users/username/Downloads/DataCleaner username$ ls -ltr total 224 -rw-r--r--@ 1 username COMNP\Domain Users 278 Sep 14 2016 NOTICE.txt -rw-r--r--@ 1 username COMNP\Domain Users 7802 Sep 14 2016 COPYING.txt -rwxr-xr-x@ 1 username COMNP\Domain Users 1313 Sep 19 2016 datacleaner.sh -rw-r--r--@ 1 username COMNP\Domain Users 95390 Jan 25 2017 DataCleaner.jar drwxr-xr-x@ 256 username COMNP\Domain Users 8192 Jan 25 2017 lib drwxr-xr-x@ 3 username COMNP\Domain Users 96 Jan 25 2017 DataCleaner.app Hostname:DataCleaner username$

LosD commented 5 years ago

The -v argument mounts a volume inside the docker container.

So what @kaspersorensen did there was mount the folder on his system (~/.datacleaner/5.2.2) to a folder inside the container (/dc_data).

So if instead your job and conf file is in e.g. /my/input/data, the command would be docker run --rm -v /my/input/data:/dc_data kaspersor/datacleaner -conf /dc_data/conf.xml -job /dc_data/jobs/myjob.analysis.xml

isethy commented 5 years ago

Ok understood this part. So basically I have to substitute my folder info.. something like the below:

docker run --rm -v /Users/username/.datacleaner/5.1.5/:/dc_data kaspersor/datacleaner -conf /Users/username/.datacleaner/5.1.5/conf.xml -job /Users/username/.datacleaner/5.1.5/Test_datacleaner/table_select_values.analysis.xml

One question I do not see a container for data cleaner yet. where am i getting that dc_data. Sorry new to this so have so many questions.

username$ docker ps -a CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 5041870d2d75 oracle/database:12.2.0.1-ee "/bin/sh -c 'exec $O…" 2 hours ago Up About an hour (healthy) 0.0.0.0:1521->1521/tcp, 0.0.0.0:5500->5500/tcp oracle1 5c11c92a16d6 mysql:5.6 "docker-entrypoint.s…" 47 hours ago Created local_mysql username$

LosD commented 5 years ago

The container isn't there because the container is nuked as the command finishes (--rm does that). Otherwise you'd get a new container every time you ran the command (I think. It might reuse it).

"dc_data" is just a random path where your datacleaner folder is mounted inside the temporary container. You could call it "my_amazing_datacleaner_folder" if you wanted. However, you need to use that folder for the whole command, so if we're using "dc_data", the correct command would be: docker run --rm -v /Users/isethy/.datacleaner/5.1.5/:/dc_data kaspersor/datacleaner -conf /dc_data/conf.xml -job /dc_data/table_select_values.analysis.xml.

LosD commented 5 years ago

(the reason for the whole folder juggling is that commands run inside a Docker container cannot access the host file system directly, so it's not possible to just load the files without mounting the folder inside the container and access them from there. It's a bit of an annoyance, but that's part of the Docker deal I guess. It does make Docker quite a bit more secure, a malicious image can't just go crazy on your host computer)

isethy commented 5 years ago

Wow thats a great explanation, thanks @LosD for explaining it so nicely!

LosD commented 5 years ago

You're very welcome! :)

Did you get it to work?

isethy commented 5 years ago

@LosD : I will have to download the data cleaner again as I goofed up. I had couple of data cleaner version. I will run the command and will let you know.

isethy commented 5 years ago

@LosD @kaspersorensen

I am running into the issue while executing the data cleaner job through docker. I was just curious are the oracle jdbc driver included in your docker image.

$ docker run --rm -v ~/.datacleaner/5.2.2:/dc_data kaspersor/datacleaner -conf /dc_data/conf.xml -list DATASTORES -job /dc_data/jobs/Test_datacleaner/table_select_itm_alias_src_val.analysis.xml -ds oracle_meds_dev_sb INFO 21:26:30 DataCleanerHome - Initializing DATACLEANER_HOME INFO 21:26:30 DataCleanerHome - Running in standard mode. Failed to load DataCleaner version from manifest: null Failed to load DataCleaner version from manifest: null INFO 21:26:30 DataCleanerHome - Attempting to build DATACLEANER_HOME in user.home: /root/.datacleaner/5.2.2 -> file:///root/.datacleaner/5.2.2 INFO 21:26:30 DataCleanerHome - Folder file:///root/.datacleaner/5.2.2 does not exist. Trying to create it. INFO 21:26:30 DataCleanerHome - Folder file:///root/.datacleaner/5.2.2 created successfully. Attempting to build DATACLEANER_HOME here. INFO 21:26:30 DataCleanerHomeUpgrader - Did not find a suitable upgrade candidate Using default log configuration: jar:file:/opt/DataCleaner/DataCleaner.jar!/org/datacleaner/log4j-default.xml ERROR 21:26:32 CliRunner - Exception thrown in java.lang.IllegalStateException: Could not initialize JDBC driver Error: java.lang.IllegalStateException: Could not initialize JDBC driver at org.datacleaner.connection.JdbcDatastore.initializeDriver(JdbcDatastore.java:258) at org.datacleaner.connection.JdbcDatastore.createDataSource(JdbcDatastore.java:197) at org.datacleaner.connection.JdbcDatastore.createDatastoreConnection(JdbcDatastore.java:267) at org.datacleaner.connection.UsageAwareDatastore.getDatastoreConnection(UsageAwareDatastore.java:114) at org.datacleaner.connection.UsageAwareDatastore.openConnection(UsageAwareDatastore.java:125) at org.datacleaner.connection.JdbcDatastore.openConnection(JdbcDatastore.java:127) at org.datacleaner.connection.JdbcDatastore.openConnection(JdbcDatastore.java:50) at org.datacleaner.configuration.SourceColumnMapping.autoMap(SourceColumnMapping.java:76) at org.datacleaner.job.JaxbJobReader.create(JaxbJobReader.java:368) at org.datacleaner.cli.CliRunner.runJob(CliRunner.java:362) at org.datacleaner.cli.CliRunner.run(CliRunner.java:180) at org.datacleaner.bootstrap.Bootstrap.runCli(Bootstrap.java:276) at org.datacleaner.bootstrap.Bootstrap.runInternal(Bootstrap.java:194) at org.datacleaner.bootstrap.Bootstrap.run(Bootstrap.java:102) at org.datacleaner.Main.main(Main.java:165) at org.datacleaner.Main.main(Main.java:150) Caused by: java.lang.ClassNotFoundException: oracle.jdbc.OracleDriver at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:264) at org.datacleaner.connection.JdbcDatastore.initializeDriver(JdbcDatastore.java:256) ... 15 more $

LosD commented 5 years ago

I think that is the message you'll get when DataCleaner cannot find the JDBC driver for the database. Could I get you to post your .conf file (make sure to remove anything sensitive like passwords first. If I remember correctly, they're encrypted, but only well enough to avoid quick glances, not enough for anyone determined to crack them)?

@kaspersorensen Do we have any way to load other JDBC drivers in the Docker version, or will a custom Dockerfile be needed? Looking at the standard Dockerfile I think that the answer is "custom Dockerfile", but maybe there's something I haven't considered.

kaspersorensen commented 5 years ago

Hmm interesting issue. I don't think the current version has a way to add custom drivers. But in #1823 the discussion contains a fix for adding a "local lib folder" which could solve this issue too.

Unrelated remark: The exception message thrown here could be improved by adding the driver class that it's trying to initialize. I'll post a small fix (PR) for that.

isethy commented 5 years ago

Please find the content of conf.xml

cat conf.xml <?xml version="1.0" encoding="UTF-8" standalone="no"?>

DataCleaner configuration Configures DataCleaner's initial environment. This includes example datastores and example reference data. 4.0 DataCleaner.org jdbc:hsqldb:res:orderdb;readonly=true org.hsqldb.jdbcDriver SA username password token datastores/customers.csv UTF8 " , jdbc:mysql://localhost:33063/XXXXXXX?defaultFetchSize=-2147483648&largeRowSizeThreshold=1024&zeroDateTimeBehavior=convertToNull com.mysql.jdbc.Driver XXXXXXX XXXXXXXXX true jdbc:mysql://localhost:33069/?defaultFetchSize=-2147483648&largeRowSizeThreshold=1024 com.mysql.jdbc.Driver XXXXX XXXXXXX true jdbc:oracle:thin:@localhost:62985/XXXXXX oracle.jdbc.OracleDriver XXXXXX XXXXXXX true datastores/job_title_synonyms.txt UTF8 false DataCloud XXXXXXX XXXXXXXXX
isethy commented 5 years ago

@LosD ^^^^

LosD commented 5 years ago

~Please note that this does not currently work, see end of post~

Edit: It works! It was just me being a fool with path separators. I've updated the code, and bumped version to 5.7.0 (see discussion about 5.2.2 and old configurations later)

Something seems to have gone wrong with the posting of your config, but anyway, it looks like you need Oracle and MySQL drivers. None of those are distributed with DataCleaner, so a custom Dockerfile will be needed, and a way to put the drivers onto the Docker container will be as well.

If we start with the last part, that is pretty simple: We'll just do the same trick that has already been done for the config file and job file: Copy the JDBC drivers to a folder (we'll call it "jdbc_drivers" and put it in the home folder in this example), then map the folder as a volume. Let's call the volume "/dc_custom_libs": docker run --rm -v ~/jdbc_drivers:/dc_custom_libs -v ~/.datacleaner/5.2.2:/dc_data kaspersor/datacleaner -conf /dc_data/conf.xml -job /dc_data/jobs/Test_datacleaner/table_select_itm_alias_src_val.analysis.xml

... but of course, that won't help if DC doesn't know how to load from there. Let's try to fix that.

Here is a custom Dockerfile, based on the standard one, and the ideas from the #1823 discussion referenced above:

FROM openjdk:8-jdk-alpine

RUN apk add --no-cache curl unzip

ENV DATACLEANER_VERSION 5.7.0

RUN mkdir -p /dc_custom_libs && \
  mkdir -p /opt && \
  curl -L https://github.com/datacleaner/DataCleaner/releases/download/DataCleaner-$DATACLEANER_VERSION/DataCleaner-$DATACLEANER_VERSION.zip > /opt/datacleaner.zip && \
  unzip /opt/datacleaner.zip -d /opt && \
  rm -f /opt/datacleaner.zip

WORKDIR /opt/DataCleaner

ENTRYPOINT ["java","-cp","/opt/DataCleaner/DataCleaner.jar:/opt/DataCleaner/lib/*:/dc_custom_libs/*","org.datacleaner.Main"]

Now, we'll need to build that: docker build -t datacleaner:custom

Then run the job using the new tag (notice that I'm using "datacleaner:custom" instead of kaspersor/datacleaner, since we're now running a local tag): docker run --rm -v ~/jdbc_drivers:/dc_custom_libs -v ~/.datacleaner/5.2.2:/dc_data datacleaner:custom -conf /dc_data/conf.xml -job /dc_data/jobs/Test_datacleaner/table_select_itm_alias_src_val.analysis.xml

~This is here where I'd love to say that it worked perfectly. Unfortunately, it doesn't, and I'm not sure why. I get this error message: Error: Could not find or load main class org.datacleaner.Main, which is odd, since I'm explicitly including all jar files (main JAR as well as auxiliary JARs in lib), but I must admit it's been a while since I used Java, so maybe it's me that has forgotten how classpath loading is supposed to work. Any inputs @kaspersorensen?~

isethy commented 5 years ago

Thank you @LosD @kaspersorensen. I was able to make it run by modifying the entrypoint at runtime and then add all the class path locations(from my system and docker images) as below and then calling the main function.

docker run --rm --entrypoint java \ -v /Users/username:/Users/username kaspersor/datacleaner \ -cp "/Users/username/oracle/instantclient_12_1/ojdbc7-12.1.0.2.jar:/opt/DataCleaner/DataCleaner.jar:/opt/DataCleaner/modules/:/opt/DataCleaner/lib/" org.datacleaner.Main \ -conf /Users/username/.datacleaner/5.1.5/conf.xml.docker -job \ /Users/username/.datacleaner/5.1.5/jobs/Test_datacleaner/table_select_itm_alias_src_val.analysis.xml

isethy commented 5 years ago

@LosD need help on removing the error messages from the log. I believe it's coming because the docker image is having some bad login credentials.

ERROR 15:19:31 RemoteDescriptorProviderImpl - Cannot get list of remote components on https://services.datacleaner.org org.datacleaner.restclient.RESTClientException: Bad credentials (error code 401)

LosD commented 5 years ago

First, that's great! ... And your example showed exactly what the issue was with mine: It's because I'm an idiot. 😄

I used the Windows path separator, ;, but it's a Linux Docker image, so I should of course have used the Unix path separator :. 🤦‍♂

The bad credentials is coming from the DataCloud part of your configuration. That service was shuttered at the end of 2018, and wasn't removed from DataCleaner OSS until 5.3, so the 5.2.2 version is trying to contact the DataCloud service. Just remove that section, and I believe it should stop.

isethy commented 5 years ago

Thanks you @LosD

LosD commented 5 years ago

You're welcome :smile: Is everything working as it should now? Then I guess we can close this.

@kaspersorensen do you want me to make a PR to your datacleaner-docker repo with the version bump and the /dc_custom_libs change?

kaspersorensen commented 5 years ago

Do I want a free cake? Sure ;-)