Norconex / jef

Job Execution Framework.
Apache License 2.0
4 stars 5 forks source link

Jef job properties file is not a valid properties file #9

Closed danizen closed 6 years ago

danizen commented 6 years ago

My decision to build a workflow engine around Norconex collectors using Spotify luigi, which is Python, has the consequence that I need to read and potential write Java property files such as the variables file for a crawl or even the jef job status file.

Looking in norconex.jef4.status.FileJobStatusStore, I see that the status file is basically written as a properties file with locking, using The locking is fine, and I've adapted to it.

But the use of writeUTF, which writes a short, and then the properties string as a modified UTF string means that this file does not comply at all with the conventions of property files. Since property files are ISO-8859-1, or latin-1, the data may be first encoded as latin-1, and then written as modified UTF.

My intuition is that use of writeUTF and readUTF is convenient and the use of `RandomAccessFile is primarily for the locking capability. If I am right, then I am asking as a compatibility fix to comply with either wikipedia's take on Java properties or another clear standard.

For now, I update my code to read a short (2-bytes), and then read that many bytes, and then pass it off as a string buffer to my logic that reads property files, and assume that it is unlikely that anything above 0x7F will appear there anyway.

danizen commented 6 years ago

I've solved this the wrong way - in my Python code, which is quite subtle and hard to maintain. A better way is to write an implementation of IJobStatusStore, and then cleanly have my collector life cycle listener call getJobSuite and so on until it can call setJobStatusStore. There is no need to parse the file to update progress information, I can store it in MongoDB directly.

I'll leave this open because it would be better if the FileJobStatusStore was a bit more compatible and less subtle, but maybe that would be done by implementing a PropertiesFileJobStatusStore or something like that.

essiembre commented 6 years ago

Because progress storing is done through an interface (IJobStatusStore) the implementation can be what you want/need. The default one is suited for JEF and does not have interoperability with other languages as a goal and does not aim to be fully compatible Java Properties format (so no plan to change it).

Since this is a duplicate of #8, it is working as designed, is not a defect and you ask no questions, I am closing again. :-) Unless I missed something? We can re-open if so.

In your case, since you are using Python, I suggest you create a more portable implementation such as JSON or XML.