datacleaner / DataCleaner

The premier open source Data Quality solution
GNU Lesser General Public License v3.0
596 stars 180 forks source link

XML as source is missing fields when building a new job #1937

Open ruben-jardim opened 2 years ago

ruben-jardim commented 2 years ago

Dear DataCleaner community,

I wanted to test DataCleaner because I wanted to suggest for it to be used for an internal project in our team, but when I loaded a an big XML file of "Users" from SAP SuccessFactors, I noticed that some fields were missing, like "onBoardingID" and "custom04", "custom06", "custom07" and so on:

image

I tried it both on my Mac and on Windows, same issue.

Any idea why? Or how to work around it/solve it?

kaspersorensen commented 2 years ago

Hi @ruben-jardim The tool can either auto-detect table/column structures or you can define it yourself. There's a bit of documentation around it here: https://datacleaner.github.io/docs/5.7.0/html/ch10s02.html#configuration_file_datastore_xml I'm guessing that you're using auto-detection and it's not creating the tables you expected so the fields get thrown around in too many tables for you.

ruben-jardim commented 2 years ago

Thanks for your reply. Indeed, I noticed that part of the documentation, but it wasn't clear to me where or how (in DataCleaner) I should code the XML xpath to define these structures. Also, the link to MetaModel was broken so I had to google to find the most up to date version and that didn't clarify how to do that inside DataCleaner (is there a folder that I should create a config file? Or is there a functionality that lets me do this?)

Could you please help me understand the steps on how to achieve this? Any example with screenshots are greatly appreciated if possible.

kaspersorensen commented 2 years ago

Hi @ruben-jardim It was a while back so I couldn't write this from memory. Hence the delay in replying ;-) The datastore XML needs to be written manually into the conf.xml file which you can find in $DATACLEANER_HOME. The $DATACLEANER_HOME folder varies a bit dependending on your operating system, but usually is inside your user's home folder, called .datacleaner.