Closed danizen closed 6 years ago
I've solved this the wrong way - in my Python code, which is quite subtle and hard to maintain. A better way is to write an implementation of IJobStatusStore, and then cleanly have my collector life cycle listener call getJobSuite
and so on until it can call setJobStatusStore
. There is no need to parse the file to update progress information, I can store it in MongoDB directly.
I'll leave this open because it would be better if the FileJobStatusStore
was a bit more compatible and less subtle, but maybe that would be done by implementing a PropertiesFileJobStatusStore
or something like that.
Because progress storing is done through an interface (IJobStatusStore) the implementation can be what you want/need. The default one is suited for JEF and does not have interoperability with other languages as a goal and does not aim to be fully compatible Java Properties format (so no plan to change it).
Since this is a duplicate of #8, it is working as designed, is not a defect and you ask no questions, I am closing again. :-) Unless I missed something? We can re-open if so.
In your case, since you are using Python, I suggest you create a more portable implementation such as JSON or XML.
My decision to build a workflow engine around Norconex collectors using Spotify luigi, which is Python, has the consequence that I need to read and potential write Java property files such as the variables file for a crawl or even the jef job status file.
Looking in
norconex.jef4.status.FileJobStatusStore
, I see that the status file is basically written as a properties file with locking, using java.io.RandomAccessFile. The locking is fine, and I've adapted to it.But the use of
writeUTF
, which writes a short, and then the properties string as a modified UTF string means that this file does not comply at all with the conventions of property files. Since property files are ISO-8859-1, or latin-1, the data may be first encoded as latin-1, and then written as modified UTF.My intuition is that use of writeUTF and readUTF is convenient and the use of `RandomAccessFile is primarily for the locking capability. If I am right, then I am asking as a compatibility fix to comply with either wikipedia's take on Java properties or another clear standard.
For now, I update my code to read a short (2-bytes), and then read that many bytes, and then pass it off as a string buffer to my logic that reads property files, and assume that it is unlikely that anything above 0x7F will appear there anyway.