eXist-db / exist

eXist Native XML Database and Application Platform
https://exist-db.org
GNU Lesser General Public License v2.1
421 stars 179 forks source link

REST PUT of docs with entity resolution fails differently if uploaded in different order #1757

Open marc-karnovgroup opened 6 years ago

marc-karnovgroup commented 6 years ago

What is the problem

Uploading of many documents with DTD entity resolution enabled fails on different documents dependent on order of docs uploaded

Se attached exist_jetty_2018_02_28.request.log

Examples of failing/working PUT uploads (with same running server instance, but different upload order) are

 grep oaaLBKG20061001_H.7000370945.xml exist_jetty_2018_02_28.request.log
 grep oaaSVEJL201632-1.1_R.7000754745.xml exist_jetty_2018_02_28.request.log
 grep oaaBKG2010284_R.7000431545.xml exist_jetty_2018_02_28.request.log
 grep oaaMI_SU_2012_490_R.7000619809.xml exist_jetty_2018_02_28.request.log

What did you expect

Out of 4637 XML docs 62 do not validate against the DTD's, so I would expect 4637 - 62 = 4575 documents uploaded without problems.

I expect the same document to either fail or succeed, independently of order of post requests on same document set.

I expect the number of uploaded documents to be equal for multiple uploads of the same documents.

Describe how to reproduce or add a test

Those 4 uploads of the same files have been done in different order using curl PUT:

for f in `ls /home/marc/karnov/data/exp/` ; do
    echo "$f"
    url="http://localhost:8080/exist/rest/db/karnov/dk/101/$f"
    echo -n "PUT $url " >> exist_put.log
    curl -u admin:nimda -w "%{http_code}" -X PUT -T "/home/marc/karnov/data/exp/$f" "$url" >> exist_put.log
    echo "" >> exist_put.log
done

replace the ls in the first line with ls -r , ls -S , and ls -rS for each upload, thus enforcing different order. More docs are added to the same DB (no document remove between runs):

ls: 4474 documents
ls  and ls -r: 4563 documents
ls  and ls -r and ls -S : 4574 documents
ls  and ls -r and ls -S and ls -rS: 4577 documents

I can provide the used XML Catalogs, DTD's and XML data files upon request per e-mail, but can not upload them in a public visible bug report.

Context information

Please always add the following information

exist_jetty_2018_02_28.request.log

Contact info e-mail: marc-dot-cromme-at-karnovgroup-dot-com

dizzzz commented 6 years ago

Note that there is grammar caching done on the server; You might want to check with `validation:show-grammar-cache() what is happening.

See http://exist-db.org/exist/apps/fundocs/view.html?uri=http://exist-db.org/xquery/validation&location=java:org.exist.xquery.functions.validation.ValidationModule

marc-karnovgroup commented 6 years ago

Well, the grammar cache looks OK http://localhost:8080/exist/rest/db/karnov/101?_query=validation:show-grammar-cache() shows expected DTD's <LiteralSystemId>alllov.dtd</LiteralSystemId> and <LiteralSystemId>allftbook.dtd</LiteralSystemId>

But I don't see why a cached grammar should result in different upload behaviour dependent on upload order ??? The DTD's should be hit and cached on firs encounter, no matter what document upload order is given ??

dizzzz commented 6 years ago

it was just an idea; I gave up on DTDs years ago....

adamretter commented 6 years ago

@marc-karnovgroup is there any extraneous error information present in exist.log?

Also are you able to reduce this to smaller test with just 2 or 3 files for validation?

marc-karnovgroup commented 6 years ago

I'll attach the exist.log here. The Catalog seems loaded fine (and otherwise I would have much more failures in XML load).

2018-02-28 11:59:37,817 [main] INFO  (Configuration.java [configureValidation]:1493) - Add catalog uri file:///home/marc/karnov/git/xml-schemas/catalog-ilse-variant-utf-8.xml 
2018-02-28 11:59:37,818 [main] INFO  (GrammarPool.java [<init>]:54) - Initializing GrammarPool. 
2018-02-28 11:59:37,820 [main] INFO  (JettyStart.java [run]:164) - Configuring eXist from /home/marc/local/exist_400/conf.xml

There is one error under startup, which seem unrelated to the issue

2018-02-28 11:59:38,974 [main] INFO  (MimeTable.java [load]:244) - Loading mime table from file /home/marc/local/exist_400/mime-types.xml 
2018-02-28 11:59:40,331 [main] ERROR (DocumentTriggers.java [afterCreateDocument]:238) - Invalid URI: xmldb URI scheme does not start with xmldb:: http://exist-db.org/xquery/console 
java.lang.IllegalArgumentException: Invalid URI: xmldb URI scheme does not start with xmldb:: http://exist-db.org/xquery/console
        at org.exist.xmldb.XmldbURI.create(XmldbURI.java:219) ~[exist.jar:4.0.0]
        at org.exist.xmldb.XmldbURI.append(XmldbURI.java:595) ~[exist.jar:4.0.0]

I have many of transactions not committed, though, which may explain why documents are not uploaded ??

2018-02-28 12:00:26,307 [qtp645323781-25] WARN  (TransactionManager.java [close]:186) - Transaction was not committed or aborted, auto aborting! 

exist.log

I can probably reduce the test set somehow, but the issue seems to be connected to many following REST PUT uploads and their order. The first file in the first upload ordered by ls failing dependent on order is oaaBKG2002928_R.7000255906.xml, which is the 556th file loaded.

If I upload some of the failing files one by one manually, they do succeed (except for the 62 ones which are not DTD valid), but the the upload order is also different then.

I tries also to run the same upload order with a one second sleep in between every upload, which results in the same amount of documents failing, so I do not believe it is a race condition.

I have tried this with both 3.6.1 and 4.0.0 and have repeated the experiment three times with the 4.0.0 version, all numbers of accepted documents according to upload order are repeatable, so it seems to be a deterministic issue.

I am happy to provide my XML files and Calatog setup, DTD's and the like by personal communication, including a step-by-step description on how to reproduce the issue. Please notice that this info is regarded confidential by my organization, and should only be used for bug verification.

marc-karnovgroup commented 6 years ago

I did test this some more, and come to the conclusion that it is not the DTD validation causing this erratic behaviour, but the entity resolution. All test documents are in latin-1 and have plenty of entities defined, which are resolved during the exist-db insert.

I did try to convert all input docments to uft8 with all entities resolved, using

xmlllint --dtdattr --noent --nonet --encode UTF-8 in_latin1.xml > out_utf8.xml

Upload of all utf8 documents succeeds now, regardless of order (except for those violating the DTD, or those having unresolved entities).

My working hypothesis is now that it is the mass of entity resolutions which does trigger the issue.

dizzzz commented 6 years ago

thnx for the input. Interesting observation....