problem with encoded additional HTML characters

zmiimz commented 7 years ago

I am playing around with xml string and with attribute Name which contains some defined HTML string

<Student Name="Émily" />

and FoX gives from extractDataAttribute() string "&mily", and not the original one. I would rather be happy with "Émily" string in order to have a chance to translate it manually to "Émily" ...

andreww commented 7 years ago

Could you post the Fortran code you are using along with the (X)HTML document (here or elsewhere)?

zmiimz commented 7 years ago

program    xml_mini
   use FoX_dom
   use FoX_sax
   implicit none
   integer :: i
   type(Node), pointer :: doc => null()
   type(Node), pointer :: p1 => null()
   type(Node), pointer :: p2 => null()
   type(NodeList), pointer :: pointList => null()
   character(len=100) :: name

   doc => parseFile("file.xml")
   if(.not. associated(doc)) stop "error doc"

   p1 => item(getElementsByTagName(doc, "Students"), 0)
   if(.not. associated(p1)) stop "error p1"
   write(*,*) getNodeName(p1)

   pointList => getElementsByTagname(p1, "Student")
   write(*,*) getLength(pointList), "Student elements"

   do i = 0, getLength(pointList) - 1
      p2 => item(pointList, i)
      call extractDataAttribute(p2, "Name", name)
      write(*,*) "number ", i," name = ", name
   enddo

   call destroy(doc)

end program xml_mini

file.xml

<Students>
  <Student Name="April" Gender="F" DateOfBirth="1989-01-02" />
  <Student Name="Bob" Gender="M"  DateOfBirth="1990-03-04" />
  <Student Name="Chad" Gender="M"  DateOfBirth="1991-05-06" />
  <Student Name="Dave" Gender="M"  DateOfBirth="1992-07-08">
    <Pet Type="dog" Name="Rover" />
  </Student>
  <Student DateOfBirth="1993-09-10" Gender="F" Name="&#x00C9;mily" />
</Students>

output

./xml_mini.x
Students 5 Student elements number 0 name = April
number 1 name = Bob
number 2 name = Chad
number 3 name = Dave
number 4 name = &mily

andreww commented 7 years ago

I've now had a chance to take a proper look at this. I'm afraid the way FoX is set up (and, in particular, the way the SAX parser works) makes it impossible to 'smuggle' a non-ascii character in and out of the DOM as a character reference. The main problem is that tokenisation of the document involves converting character references into their ascii representation and putting the result into an array of Fortran characters.

If É is included in text (between element tags) the SAX parser gives an error apologising that it "cannot digest" the character reference. This is the intended behaviour. When using the DOM you just end up with a "parsing failed" error, but this is ultimately the same error. I think it's a bug that you don't see this error when the character reference is part of an attribute value. This should probably be fixed...

To properly fix this would involve finally making the upgrade to allow FoX to handle unicode. Those arrays of fortran characters would need replacing with integer arrays of unicode code points, and the reading and writing sorted out (Toby White once figured out this bit, it is possible in modern Fortran).

I think any quick fix to try to avoid the problem by storing the character reference is going to be very messy and involve surgery to the SAX parser and, I think, modifications to the DOM code. I really wouldn't want to go down that road.

zmiimz commented 7 years ago

Dear Andrew, thank you for the answer. I am aware of problematics of unicode characters in Fortran but the ability to hande (or ignore) extended special XHTML characters is rather a (basic and expected) feature of any modern xml parser ( this example comes from the http://rosettacode.org/wiki/XML/Input#C and most of parsers used there support such characters trafo). So, without changing the mentioned input file, the only option for fortran now is writing interface and using LIBXML2 ?

andreww commented 7 years ago

Yes, I think so - I don't certainly know of an XML parser written in Fortran that supports character references to unicode characters.

andreww / fox

problem with encoded additional HTML characters #44