andreww / fox

A Fortran XML library
https://andreww.github.io/fox/
Other
59 stars 50 forks source link

Parsing large files #24

Closed ramos closed 11 years ago

ramos commented 11 years ago

I am trying to use the FoX library to parse large xml files with data (~2MB). The methods (sax, dom) take forever to parse some files that MATLAB/C/Python/... can parse in a couple of seconds.

Is this a known issue with the library or am I doing something wrong?

andreww commented 11 years ago

Are you using a released version or the tip of the master branch from this repository? There are a series of changes in the repository that speed up sax (and thus the dom) which have not yet made it to a formal release. If the version in the repository is too slow could you share some more information about the file - does it have long strings of text between tags or many short elements in a complex structure? Have you managed to do any profiling?

ramos commented 11 years ago

1) I am using the tip of the master branch (commit a8174c338fefb0d417289ffdc307a27acc0cedc3).

2) The file is not very complex. Some simple metadata (name, date), and a large array of double precision real numbers (980x128). I can share the file with you:

https://ajaxplorer.nspoc.com//data/public/5224475096e85e0c2da0f8f179d29e49.php?lang=en

3) The code that I am trying is very easy, simply parse the file. No profiling yet... but for small files all wors like dream.

bla bla

Write(,)'Starting to parse: ', asctime(gettime()) doc => parseFile(Trim(fn)) Write(,)'Finish to parse: ', asctime(gettime())

call destroy(doc)

4) Compiler/OS information

ifort (IFORT) 11.0 20081105 Copyright (C) 1985-2008 Intel Corporation. All rights reserved.

Linux colossus 2.6.39.4 #2 SMP Tue Jan 24 13:31:59 CET 2012 x86_64 GNU/Linux

andreww commented 11 years ago

Parsing that file is quite quick here - 5.305s using sax with all the callbacks (using the code from sax_example.f90 with validation turned off and and the file renamed staffNS.xml) and 5.005 seconds using the dom (dom_example_2 with the file renames h2o.xml). This is slow relative to xmllint (front end to libxml2) which gets through the whole lot in 0.131s but I get the impression you are getting much lower performance.

These timings are on a Mac with Gfortran:

gfortran -v Using built-in specs. COLLECT_GCC=gfortran COLLECT_LTO_WRAPPER=/usr/local/libexec/gcc/x86_64-apple-darwin10.7.0/4.6.0/lto-wrapper Target: x86_64-apple-darwin10.7.0 Configured with: ../gcc-4.6.0/configure --enable-languages=fortran,c++ Thread model: posix gcc version 4.6.0 (GCC)

Maybe the first thing to do is see if swapping the compiler helps?

andreww commented 11 years ago

Added a makefile target (in example) to test the performance of most of the code (0a56452f9d8f4). This could be used in modified form to get additional info. Closing this issue (open a new one if needed).