NASA-PDS / harvest

Standalone Harvest client application providing the functionality for capturing and indexing product metadata into the PDS Registry system (https://github.com/nasa-pds/registry).
https://nasa-pds.github.io/registry
Other
4 stars 3 forks source link

A bundle that previously loaded throws an error on reload attempt #141

Closed scholes-ds closed 10 months ago

scholes-ds commented 10 months ago

Checked for duplicates

Yes - I've already checked

🐛 Describe the bug

When I tried to re-harvest the following bundle as part of the M2020 release, it threw the following error. The error occurred with and without the -O flag on the command line. The bundle file has not changed. (bundle) https://pds-geosciences.wustl.edu/m2020/urn-nasa-pds-hausrath_m2020_pixl_naltsos/ (error) [INFO] Processing bundle directory \internalPath\data\m2020\urn-nasa-pds-hausrath_m2020_pixl_naltsos Error on line 1 column 1 SXXP0003 Error reported by XML parser: Content is not allowed in prolog.: Content is not allowed in prolog. [WARN] No bundles found in \internalPath\data\m2020\urn-nasa-pds-hausrath_m2020_pixl_naltsos

🕵️ Expected behavior

I expected [...]

📜 To Reproduce

1.This was using harvest version: 3.9.0-snapshot, build time: 2023-10-31T16:02:29Z

  1. Download the bundle and attempt a local harvest.
  2. ...

🖥 Environment Info

📚 Version of Software Used

harvest version: 3.9.0-snapshot, build time: 2023-10-31T16:02:29Z

🩺 Test Data / Additional context

https://pds-geosciences.wustl.edu/m2020/urn-nasa-pds-hausrath_m2020_pixl_naltsos/

🦄 Related requirements

🦄 #xyz

⚙️ Engineering Details

No response

al-niessner commented 10 months ago

@scholes-ds

Would you please attach your harvest configuration file (remove any secrets) and the harvest command line you are using as well? I want to reproduce as nearly as I can what you are doing - obviously I have to use a different DB but the rest should be identical.

tloubrieu-jpl commented 10 months ago

@scholes-ds was able to run this version of harvest on other bundles. It only fails on this one, the second time.

al-niessner commented 10 months ago

@jordanpadams @scholes-ds @tloubrieu-jpl

I was able to duplicate the problem with the latest harvest. There has been a fix to harvest that changes how it determines if a file is a bundle or not. The older method required the word "bundle" to be in the name. The new method looks in the XML file for the product class to see if it is a bundle -- some users do not want bundle to be in their names. It turns out, that your bundle file starts with hex: bbef 3cbf 783f. A quick look at an ascii table shows 3c to be < and 3f to be ?. It would seem the file is not UTF-8 but UTF-16 or something else. This is causing the XML check to fail because it thinks bbef is content before the prolog of <?xml...>.

Now,, that makes it sound like there is a problem with the XML. There is not because validate processes it just fine. I wanted to make sure the XML given was valid before moving forward and it was which is how I know validate is happy enough reading it.

al-niessner commented 10 months ago

@jordanpadams @tloubrieu-jpl

Okay, now I am a bit unhappy. I was playing with harvest trying to get it working and it started too. Thought I had found a fix but all the other XML files also fail with prolog problem. Looked back at bundle, and it has been rewritten as a UTF-8. I have no idea where in harvest this happened. New item to chase that is going to delay this some. Does validate do the same thing?

al-niessner commented 10 months ago

@jordanpadams @tloubrieu-jpl

Oh thank goodness. It was my emacs that changed them to UTF-8. I was playing around trying to make the error go away and fixed it accidentally.

al-niessner commented 10 months ago

@jordanpadams @scholes-ds @tloubrieu-jpl

Fixed all XML files with emacs to really be UTF-8 and all works well. To change them, changed the UTF-8 to UTF-16 in the xml. Saved. Changed UTF-16 back to UTF-8 and emacs senses that and writes the file as UTF-8. Interestingly, it saved the UTF-16 in its original form.

Just for sanity, collected the data again and checked the md5:

$ md5sum bundle_hausrath_m2020_pixl_naltsos.xml 
1aac93a1c69ad40392a067080de29ae5  bundle_hausrath_m2020_pixl_n$ cat urn-nasa-pds-hausrath_m2020_pixl_naltsos.md5 
1AAC93A1C69AD40392A067080DE29AE5  \bundle_hausrath_m2020_pixl_naltsos.xml
altsos.xml

Despite some screaming, they are the same files. The looks at bundle with hexdump again:

$ hexdump bundle_hausrath_m2020_pixl_naltsos.xml | head
0000000 bbef 3cbf 783f 6c6d 7620 7265 6973 6e6f

It is a 16-bit encoding. Technically, it is wrong because it says (XML statement in file) it is UTF-8. Still need to figure out why/how validate is indifferent to the encoding mismatch.

Ah, forgot to mention that the file size is 3 bytes smaller when corrected by emacs. So it is not really changing UTF-16 to UTF-8 but ditching the efbbbf (other XML files have different prefix and byte swap for hexdump). It just makes the file look 16-bit encoded when it is not really. Not sure what these magic bytes are doing there but they are erroneous and validate seems to ignore them.

al-niessner commented 10 months ago

@jordanpadams @scholes-ds @tloubrieu-jpl

Fixed. They are the same if you do not change them... They are the BOM. Thought they look familiar. Updated the reading of the file to include the BOM. Works now. Adding some tests then will add the fix shortly.

jordanpadams commented 10 months ago

@al-niessner thanks for tracking this down! very interesting how emacs handles this. ~also, apologies for my thickness, but what does BOM stand for?~

nevermind. I found it. BOM - ByteOrderMark

al-niessner commented 10 months ago

@al-niessner thanks for tracking this down! very interesting how emacs handles this. ~also, apologies for my thickness, but what does BOM stand for?~

nevermind. I found it. BOM - ByteOrderMark

@jordanpadams

There is an endian war in progress on whether or not UTF-8 should even have a BOM. Now we know which side to find emacs.