webdav file size: cannot GIT an exist webdav folder

Torcsi commented 6 years ago

What is the problem

File size improperly reported on webdav As a consequence, files cannot be stored into git.

What did you expect

Exact filesize; the webdav is not configurable, file size should be computable.

Describe how to reproduce or add a test

correction copy c:\temp\small.xml p:\apps\test dir p:\apps\test

... 4,096 small.xml

Context information

Please always add the following information

eXist-db version + Git Revision hash : seems irrelevant
Java version (e.g. Java8u121) : seems irrelevant
Operating system (Windows 7, Linux, MacOs) : windows
32 or 64 bit : seems irrelevant
Any custom changes in e.g. conf.xml : seems irrelevant
also reported to GIT https://github.com/git-for-windows/git/issues/1452 *

adamretter commented 6 years ago

@Torcsi At download time, eXist only knows the exact size of binary files. The size of XML files is estimated, due to serialization options etc.

However, there are two ways to deliver files over HTTP:

1) Using a Content-Length header, where you have to know the file size up front and write it into the HTTP header, before sending the content of the file itself.

2) Using a Transfer-Encoding: chunked header, which allows a streaming like approach. Where you send chunks of data, each chunk is preceded by its size. In this manner the server can buffer a small chunk, calculate the size of the buffer, send the size and then the data of the chunk, it then repeats the process until every chunk of the file is sent.

The implementation of WebDAV was done by @dizzzz. Perhaps he can comment on the approach taken in the WebDAV eXist extension? If we are using a Content-Length header, I think we should be able to easily change to Transfer-Encoding: chunked, assuming that there is nothing in the WebDAV RFC that forbids that (again @dizzzz is probably best placed to answer).

Torcsi commented 6 years ago

@adamretter thanks, developing specific HTML download is not an option, GIT cannot manage http:// ... as source directly (even if proper URL could be passed to an XQL that could do all you said, Content-Length way); performing folder traversing would mean redeveloping of half WebDAV @dizzzz I am sure nearly impossible to compute (efficiently) the file length for browsing a folder in WebDAV (for all files within a folder, it would require traversing a huge xml "file" just for getting the exact file size...: performance drop),

Have you any proposals/experience not necessarily with git for version control of eXist content? WebDAV is still a preference, since it can be integrated into e.g. intellij etc,. (Eclipse egit cannot be used anyway, due to Milton WebDAV 1.0 is insufficient, tested) Unfortunately for xml files as well. (Others we could anyway GIT right from the fs folder (somewhat dirty))

We have tested download to staging folder, git from there, upload from staging folder ==> also fails because the "creation" and "modified" dates are not maintained properly during upload, maybe @dizzzz could have a hint about. Want me to make another issue?

dizzzz commented 6 years ago

Firstly, please could you elaborate a bit more on your setup? I see no hooks for me to understand what you are doing, how to reproduce etc. One essential part is for example, which webdav client are you using? are you familiar with the limitations of this client [each client has quite a list of incompatible assumptions on how webDAV should work]

Secondly; this subject has been discussed quite recently, I already explained it all; @adamretter explained it correct for the difference between xml and non-xml documents.

Actually for binary content the size should be 100% accurate; in eXist for non-XML documents we count the actual number of bytes as inserted into the database, and this is 1:1 returned to a webdav client.

WIth this, it looks more like an issue in the Windows representation, unless the .txt document is actually an XML document. But I have doubts here.

dizzzz commented 6 years ago

I checked, for non-xml documents always the exact size is used:

https://github.com/eXist-db/exist/blob/develop/extensions/webdav/src/org/exist/webdav/MiltonDocument.java#L345

Torcsi commented 6 years ago

@dizzzz Sorry, my bad, changed the example to XML.

The complete story, environment, reproduction etc. is described for GIT here: https://github.com/git-for-windows/git/issues/1452 The index.html file there is obviously an XML file.

We want to use version control on an exist repository. XML files and "binary" files.

Client: Windows 10 built-in webdav client ==> should not be a problem, which, I am nearly sure. Confirmed: webdav size for binary (text) files is accurate., problem is with XML files "only".

Reproduce: copy any small file to a webdav folder or create an XML file from exide check the size in the webdav folder.

I understand more that it is nearly an impossible mission from WebDAV. Windows representation should not be a problem; if a file size is not reported accurately (which, I believe, is the case under unixish OSes as well) the struct stat cannot be properly read since WebDAV + exist computed the size to their best; the only way to open, seek to the end etc - which is not the way Windows/any other/ Explorer, a dir/ls command would do.

dizzzz commented 6 years ago

It is possible to set some environment variables to make the XML size estimation more accurate; on the cost of speed -which can be very very significant-

https://github.com/eXist-db/exist/blob/develop/extensions/webdav/src/org/exist/webdav/MiltonDocument.java#L58

public static final String PROPFIND_METHOD_XML_SIZE = "org.exist.webdav.PROPFIND_METHOD_XML_SIZE";
public static final String GET_METHOD_XML_SIZE = "org.exist.webdav.GET_METHOD_XML_SIZE";

accepted values: NULL, EXACT, APPROXIMATE (literal text, case insensitive)

Torcsi commented 6 years ago

@dizzzz The complete story, environment, reproduction etc. is described for GIT here: https://github.com/git-for-windows/git/issues/1452 The index.html file there is obviously an XML file.

dizzzz commented 6 years ago

But in general..... i don't think your git approach will work. Please check eXide for its file-sync functionality. Also the eXist-db Atomic editor plugin works very well.

dizzzz commented 6 years ago

index.html file there is obviously an XML file

only if the content-type has been set to an XML one

adamretter commented 6 years ago

@Torcsi I was not suggesting anything to do with HTML. I was suggesting that the WebDAV code in eXist could be improved to use chunked Transfer Encoding, which would ensure 100% accurate file sizes for both XML and Binary files.

@dizzzz Is there anything in WebDAV that prevents us from using chunked transfer encoding?

dizzzz commented 6 years ago

@adamretter I don't see the the connection with chunked-encoding with the original question yet. Additionally.... as chucked is part of the HTTP 1.1 spec it should be supported, but.....

I have seen many issues with several WebDAV client implementations in the past, so I 'trust' on some incompatibility issues when introducing the chunked encoding. :-(

Torcsi commented 6 years ago

thanks @dizzzz ad flag in Milton: WOW, but clear performance drop and may not worth

ad atomic: got the message, filesync helps in gitting we were thinking in IntelliJ (eclipse egit does not work (personally, I do not quite like atom, flies every 2 minutes for javascript because of tern, and git support is also quite simplistic) we use oxygen and considering IntelliJ We cannot use eXist 3.0 yet due to performance issues (indexing does not work as we expect).

@adamretter thanks, eXide synchronization: we do not use applications, frontend is clear JS thick client. Will try to transform current code to an application and see the support. We spend now hours with synchronization so I am thinking in any solutions including custom downloaders.

Torcsi commented 6 years ago

Info from GIT: @dscho

That is really unfortunate. Git relies on accurate file size information, it cannot function without it. I guess there are only two ways out of this:

get eXist to fix this, or

stop using Git on drives mapped into eXist's database.

adamretter commented 6 years ago

@dizzzz Okay so, the problem with delivering XML files in eXist-db over HTTP, is that we do not know the exact size of the XML file before we serialize it, so we cannot set an accurate Content-Length header.

However, we don't need to know the size of the entire XML file before serializing if we use chunked transfer encoding. Basically we tell Jetty we want to use chunked transfer encoding and then we write the data to the http output stream in the standard way. Jetty takes care of transmitting the file in smaller chunks along with the chunk size. Jetty does this by buffering each chunk, it then sends the size and the chunk. This is standard HTTP 1.1 stuff and supported by almost all clients.

Having a quick look at the Milton source code, it seems that it might support this by returning null for GettableResource#getContentLength. When null, whilst Milton doesn't explicitly set chunked transfer encoding, it does explicitly omit the Content-Length header, so it would really depend on how Jetty then processes this.

I see this can be enabled in eXist-db by setting the system property:

-Dorg.exist.webdav.GET_METHOD_XML_SIZE=true

so that might well be worth a try...

Cheers Adam.

Torcsi commented 6 years ago

@adamretter most probably the problem is not during file transfer, but during propfind get contentlength which must not try to render the data in XML at all.

dizzzz commented 6 years ago

@adamretter the transfer-encoding does not help for the 'Propfind" scenario where a 'directory listing' is created with the filesize onboard.

@Torcsi this is the scenario where in the 'slow' case each document is serialized to measure the size based on the serialization parameters. For a collection with 100 documents, each of the 100 documents will be serialized :-(

@adamretter some clients actually use the size in Propfind in stead of the size reported during "Get" of a document

dizzzz commented 6 years ago

@Torcsi it is possible to play a bit with both parameters; maybe the propfind= NULL option works for you. maybe not. I do not know the behaviour of the windows10 client.

Torcsi commented 6 years ago

C stat, fstat tested and just funny what it does.

first call: stat 0 fstat 120 second call: stat 120 fstat 120

after modifying the file first call 0 115 second call 115 115

also asked GIT people what do they use.

include "stdafx.h"

include

include

include

include <sys/stat.h>

int main() { const char * sFile= "p:\apps\test\test.xml"; struct stat fileStat; if (stat(sFile, &fileStat) < 0) return 1;

printf("Information for %s\n", sFile); printf("---------------------------\n"); printf("File Size: \t\t%d bytes\n", fileStat.st_size);

int file = 0; if ((file = open(sFile, O_RDONLY)) < -1) return 1;

if (fstat(file, &fileStat) < 0) return 1;

printf("Information for %s\n", sFile); printf("---------------------------\n"); printf("File Size: \t\t%d bytes\n", fileStat.st_size); close(file); char c = fgetc(stdin); }

Torcsi commented 6 years ago

@dizzzz Even smallest performance drop is excluded. We fight for every 10th of seconds. (1790 files in my smallest folder. Last year we have published and should store over 10.000 files.) (We cannot use exist 3.0 due to a 2x-6x performance drop... .)

I was thinking about running two servers (not at the same time) of the same DB one for GITting and one for running the processing software. But that sounds sick.

More to add: when copying an XML back to eXist the last modified file date becomes the current time. All but touch safe. ant sync also cannot work this way, continuously loosing track what is newer, what is older. Then add that git push/fetch-merge also fuzzy.

@adamretter application synchronize does not do "backward" synchronization (OOLLDD exide, maybe), i.e. if sthg changed in the folder, getting back to eXist does not seem changed.

@dizzzz atom eXist package: install already fails, retry, gee, you had same problem as well...

We are stuck with development and also safe deployment.

You did your best for sure and thanks for help... now It seems we must develop a content-sensitive sync application ourselves.

Should I leave this issue open?

joewiz commented 6 years ago

@Torcsi You wrote:

We cannot use eXist 3.0 yet due to performance issues (indexing does not work as we expect).

Could you please elaborate?

Torcsi commented 6 years ago

Thanks, @joewiz, my colleague will make a summary.

Torcsi commented 6 years ago

@joewiz study done, downloadable from here

unfortunately we cannot make synthetic tests, we are under high pressure (at this moment since a year :( )

Torcsi commented 6 years ago

@dizzzz information from GIT developer dscho

does git use lseek END, fstat or stat to determine length?

lstat(). Which in Git for Windows is either performed via GetFileAttributesW() (without the FSCache feature) or via FindFirstFileW()/FindNextFileW().

if anyway content is to be entirely read, isn't this a small file optimization to get the length?

But it is not! The entire idea of Git's index is to avoid reading the files' contents just to know whether they are up-to-date, and instead use the mtime and file size (and a couple more indicators for change).

It is a core requirement of Git to know accurate file sizes. It cannot work without.

joewiz commented 6 years ago

@Torcsi Thanks for posting. I have had to make some adjustments to my 2.2-era code when moving to 3.x to account for the identical problem. The fix is actually what you've effectively done on line 115: do not refer to the collection indirectly via a variable (e.g., $body_sections), but instead refer to collections directly (e.g., xmldb:xcollection(concat('/db/apps/sls/data/' ,$lang))/sls:data/sls:sections/sls[@uuid=$inner_uuid]/sls:body). In other words, change lines 79 and 105 to be more like line 115. Perhaps @wolfgangmm can comment on what has actually changed underneath the hood, but it probably has something to do with the query rewriting facility; it seems the query rewriter isn't as good at identifying node sets selected via variable that could benefit from indexes. I would like to be able to use variables to refer to collections, and in some cases I think I still do - there must be a pattern. But much of my code now uses collection() functions rather than variables that point to these nodes. I hope this helps you make the adjustments necessary to move to 3.x; its many other improvements are well worth the cost of making this change.

Torcsi commented 6 years ago

@joewiz thanks. Gosh, some lines of code to revise... We have systematically used the variable to avoid repetitive calculations of the same! especially if the $var (collection) is computed in a complex way, this should be optimized automagically. Surely others meet the same problem...

Anyway we will see with red if a query is not optimized.

Maybe this would worth an independent issue for keeping track of potential changes / similarities, but U would leave this out from this chain (off-topic from the original WebDAV issue above)?

Torcsi commented 6 years ago

@joewiz I have added a new issue here

Torcsi commented 6 years ago

I close this original issue related to WebDAV. It seems impossible that WebDAV could report a proper size. It can be achieved only by complete backup of a folder and working on the backup.

We must solve synchronization by other means (such as folder synchronization tools and custom backup for independent XML nodes with timestamps etc.).

eXist-db / exist