ITxPT / DATA4PTTools

Shared space for the development of the DATA4PT Greenlight NeTEx validation tool(s)
MIT License
10 stars 2 forks source link

xsd validation is too slow #9

Closed skinkie closed 1 year ago

ollelar commented 2 years ago

Thank you, @skinkie. Can you be more specific? File size, time to complete test, what environment you're on etcetera.

skinkie commented 2 years ago

Thank you, @skinkie. Can you be more specific? File size, time to complete test, what environment you're on etcetera.

Several agencies have been checked, the last bunch from Denmark.

thbar commented 2 years ago

I can give a specific example with data file. I have tried to validate a largish file (~222MB unzipped) which can be found at https://transport.data.gouv.fr/datasets/horaires-des-lignes-ter-sncf/?locale=en, section "NeTEx resources". Heads-up: the file is entitled "Export au format CSV" but this is a data import bug that we must fix. Also, the file contains (at time of writing) a single encoding error (ISO-8859-1 instead of UTF-8, see https://github.com/etalab/transport-qualite-des-donnees/issues/4) which you will have to fix manually for now.

I have started running the validator outside of Docker, directly on a recent Mac M1, and the process has been running for 40 minutes, and it isn't finished.

Happy to provide more input if needed!

thbar commented 2 years ago

(Final stat on the case I mentioned: the run took 47 minutes, but on a fairly beefy Mac M1 ; on our production setup, it would likely be much slower)

pkvarnfors commented 2 years ago

We are aware of the performance issues but have so far prioritized to get the tool and web interface to a working state, including some extra validation scripts and better documentation. We will look more into how we can improve the performance in coming releases.

pkvarnfors commented 2 years ago

It is very useful to get examples of working and not working/slow files, thank you for that.

pkvarnfors commented 1 year ago

Fixed in version 0.5.5