Exiv2 / exiv2

Image metadata library and tools
http://www.exiv2.org/
Other
914 stars 279 forks source link

excessively often calls stat() #515

Closed jmberg closed 5 years ago

jmberg commented 5 years ago

I've found that when reading PNG files, (lib)exiv2 calls stat() excessively often:

$ strace -e stat exiv2 /path/to/image.png 2>&1|grep stat|wc -l
7265

Yes, the file in question is 114MB, but there's no good reason to call stat() on it over and over since it's unlikely to change in the time it takes to read :-) (also, given the various relatively recent security mitigations, doing syscalls has gotten slower)

FWIW, this was a file without any exif data, perhaps if a chunk early in the file had any you could abort?

jmberg commented 5 years ago

moved from https://bugs.kde.org/show_bug.cgi?id=400604 (digikam) moved from http://dev.exiv2.org/issues/1374

clanmills commented 5 years ago

Groan. I've no idea why this is true. I'm putting in 12 hour days at the moment to get to v0.27 for Christmas.

Here's what I see on my favourite test file (http://clanmills.com/Stonehenge.jpg) which I've converted to Stonehenge.png

rmills@rmillmbp-ubuntu:~/gnu/github/exiv2/exiv2/build$ strace -e stat exiv2 ~/Stonehenge.jpg  2>&1 | wc
     20     175    1767
rmills@rmillmbp-ubuntu:~/gnu/github/exiv2/exiv2/build$ strace -e stat exiv2 /Home/Stonehenge.png  2>&1 | wc
     18     167    1636
rmills@rmillmbp-ubuntu:~/gnu/github/exiv2/exiv2/build$ 

I'm surprised that basicio.cpp calls stat() more than once. However 20/18 system calls doesn't seem excessive. When I use the filter | grep fstat, 17/16 calls to stat().

clanmills commented 5 years ago

I've looked at this. BasicIo::eof() calls ::stat() to get the file size and compares it to the current file position. It could alternatively call feof(). However, the semantics of BasicIo::eof() are have you read all the data? and that's slightly different from ::feof() which does not return true until you try to read past the end-of-file.

#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>

int main(int argc,const char* argv[])
{
    if ( argc != 2 ) {
        const char* prog = argv[0];
        printf("usage: %s path\n",prog);
    } else {
        const char* path = argv[1];
        struct stat st;
        if ( ::stat(path,&st) == 0) {
            printf("size = %lld\n",st.st_size);
            char* buffer = (char*) malloc(st.st_size);
            if (  buffer ) {
                FILE* f = fopen(path,"r");
                if ( f ) {
                    size_t n = fread(buffer,1,st.st_size,f);
                    printf("read %ld bytes\n",n);
                    printf("feof(f) = %d ftell(f) = %ld\n",feof(f),ftell(f));
                    fread(buffer,1,1,f);
                    printf("feof(f) = %d ftell(f) = %ld\n",feof(f),ftell(f));
                } else {
                    printf("%s did not open",path);
                }
            }
        } else {
            printf("%s does not exist",path);
        }
    }
    return 0;

}
553 rmills@rmillsmbp:~/temp $ ls -l eof.cpp ; make -B eof 2>/dev/null ; eof eof.cpp
-rw-r--r--@ 1 rmills  staff  1027  3 Nov 10:52 eof.cpp
c++     eof.cpp   -o eof
size = 1027
read 1027 bytes
feof(f) = 0 ftell(f) = 1027
feof(f) = 1 ftell(f) = 1027
554 rmills@rmillsmbp:~/temp $

You haven't provided evidence that this is causing a problem, so I'm very reluctant to change the code in this area. You're welcome to persuade me that this deserves attention.

cgilles commented 5 years ago

Robin,

All optimizations in this area are very importants for the future. In digiKam, when you manage 100.000 through a sqlite database, or 200.000 and more with Mysql/Mariadb database, it's clear : there is a bottleneck in Exiv2.

This is more problematic when image collections are on a shared device, through the local network.

In digKam, the process to register all image information in database is simple :

1/ We scan the collections with Exiv2 through separated threads (read metadata). 2/ We make blobs to commit data in database. We don't register all metadata, but only the most important ones used to perform complex searches. 3/ When users make changes in database, a synchronization is performed with image metadata. This is done also through separated threads (write metadata).

1/ and 2/ are problematic due to large system calls by Exiv2.

So profilling this Exiv2 part is the way to do, in case of large collection of files to manage.

Best Gilles Caulier

clanmills commented 5 years ago

Thank you for the background, Gilles. However that doesn't change my question "what's the problem?".

As you know, Gilles. The three active members of Team Exiv2 (myself, Luis and Dan) respond quickly and effectively to your requests (and to all our users). However we have to prioritise between issues that must be fixed (such as security) and nice to have. Do you consider this is something that must be fixed?

I don't agree that Exiv2 is slow. We read hundreds of images per second. Is it possible that digiKam is reading metadata excessively? It is possible to store the file stamp (or md5sum) in your database and only read the image when it has been modified?

And, as you know, I find interactions with you to be stressful as everything is "you must fix this now". The Team are all volunteers. I know that you are also a volunteer with digiKam so you're familiar with the pressures of this hobby. Perhaps you could make our interactions less stressful.

jmberg commented 5 years ago

Umm, I feel I've gotten caught between fronts here :-)

TBH, I'm not really sure it's that much of a problem. I was trying to answer the question "Why does digikam take hours to rescan my database?", and the excessive stat() calls seemed like a potential issue.

In truth, the more likely "reason" is that

So in all likelihood I was blaming the wrong thing (stat() for every read()) when the real problem was reading the whole PNG files to start with. Mind you, I probably have hundreds similar files, so I'd have shipped dozens if not hundreds of GBs of data across my network doing this ...

For now, I'd definitely classify this as "would be nice to fix, at least to not get that blamed for performance issues erroneously". However, it's also true that there's a cost to syscalls, and contrary to history that cost has been going up recently with security mitigations, rather than down as in the past. So I'd even argue that for performance it would be good to read (PNG) files in larger than 8k chunks. I'd guess you could only see this difference on fast local drives though, my overall performance was definitely network-bound.

clanmills commented 5 years ago

What a nice and thoughtful message. Thanks.

Perhaps digiKam could create sidecar files with all of the metadata. The sidecar could contain the date stamps (or md5sums) of the file and would avoid compulsively re-reading every image. The sidecars could be on local storage. Calculating md5sum presumably requires reading the whole file, so perhaps that's not helpful for your situation.

I haven't thought about the format of PNG files for some time. As you've said, it is a linked list of chunks. There's no index to enable the reader to quickly locate items of interest, or to know that there is no metadata.

Can you send me one of your files, please. You've motivated me to look further into this.

For sure, I need to instrument something to know if changes are improving performance. Here are things that I can try: 1) Modify BasicIo::eof() to use ::feof() and if it's false, do tell()==size() 2) Modify the BasicIo object to store the size if we don't write to the object. 3) Look at the png chunk scanner to see if it dies after it has found the Exif chunk (and ICC and XMP and IPTC). If you have lots of files with NO metadata, optimisation won't help. We have to read the whole file to realise there is NO metadata. 4) In investigating this, I'll have more and other ideas.

Team Exiv2 is busy working to achieve Exiv2 v0.27 by the end of 2018. I'm happy to have a little break today from cross-platform build/test/buildserver work to look at your issue.

clanmills commented 5 years ago

I've had another three ideas!

1) Use a .exv sidecar

If digiKam stored a metadata .exv file beside the image, that would probably fix it. So if you have a file DSC_1234.png, the file DSC_1234.exv can provide the metadata. The code to create the sidecar already exists.

577 rmills@rmillsmbp:~ $ exiv2 -ea --verbose --force Stonehenge.png 
File 1/1: Stonehenge.png
Writing XMP data from Stonehenge.png to ./Stonehenge.exv
578 rmills@rmillsmbp:~ $ ls -l Stonehenge*
-rw-r--r--  1 rmills  staff     45814  3 Nov 12:48 Stonehenge.exv
-rw-r--r--@ 1 rmills  staff   6798775  4 Aug 22:34 Stonehenge.jpg
-rw-r--r--@ 1 rmills  staff  31735186  3 Nov 09:12 Stonehenge.png
579 rmills@rmillsmbp:~ $ 

2) Remote Io

Exiv2 supports a RemoteIo class. The derived class SshIo class uses ssh. That would have the IO onto the ssh server and thereby avoid excessive network activity.

I don't believe anybody uses SshIo because I've never had a single bug report or question about it! RemoteIo is an abstract class and its principal derived class is HttpIo. I believe HttpIo is used a lot and the other classes FtpIo, SftpIo and others were implemented in the web-ready project.

Of course, digiKam would require GUI changes to enable you to configure appropriately. We'd need to know the mapping between NFS <-> ssh or NFS <-> http and store public keys and so on.

These classes are efficient. They use random I/O to seek and read. In the case of Ssh, it uses the command dd on the server to read byte-ranges. However for your scenario with empty pngs, we're probably still in trouble as we have the read the whole file!

3) Use a web service

The specification for web-ready included a Metadata Web Service. We didn't implement this as we didn't have enough time. The idea is to get the metadata via http(s) and all the I/O is performed on the server. This doesn't exist. web-ready was done in 2013. I hoped to continue the project in 2014, however I had other matters in my life that demanded my time.

http://clanmills.com/Homepages/2013/GSoCProposals.shtml

cgilles commented 5 years ago

Robin,

"Perhaps digiKam could create sidecar files with all of the metadata. The sidecar could contain the date stamps (or md5sums) of the file and would avoid compulsively re-reading every image. The sidecars could be on local storage. Calculating md5sum presumably requires reading the whole file, so perhaps that's not helpful for your situation."

==> Definitively no. This is the goal of the database. You cannot perform search over huge collection with sidecar, without to wait few day. The database is fast and can be centralized to host metadata from agency : one server and many client connected to local network.

About sidecar, of course digiKam is able to use these files, but it's optional. Some user want to not touche original images, some other want to host all change in images, some other both. So the use cases are complex and inherited from the long digiKam experience (the project have been started in 2001). Just take a look to digiKam setting metadata panel to understand want i mean :

https://www.flickr.com/photos/digikam/31824318968 https://www.flickr.com/photos/digikam/44782449315 https://www.flickr.com/photos/digikam/31824318208 https://www.flickr.com/photos/digikam/44782448685 https://www.flickr.com/photos/digikam/31824316898

I recommend to try the application. We have stand alone bundles for you. Nothing is installed on your computer :

https://files.kde.org/digikam/

Best

Gilles Caulier

cgilles commented 5 years ago

About remote IO and web service to share metadata : all is performed already by the database. Mariadb or Mysql can be already acceded through the network.

In digiKam we share more than image metadata :

See below the digiKam setup metadata and collection panels for details :

https://www.flickr.com/photos/digikam/30756263757 https://www.flickr.com/photos/digikam/44972055714 https://www.flickr.com/photos/digikam/31824521618 https://www.flickr.com/photos/digikam/44972055014 https://www.flickr.com/photos/digikam/31824520608

So, we need a real database. All serious applications to perform photo management use a database, as Apple, Adobe, etc.

Gilles Caulier

clanmills commented 5 years ago

I agree with your design decision to use a database.

@jmberg is asking the question: "Why does digikam take hours to rescan my database?" I'm suggesting that keeping a sidecar beside the image would speed up the rescan process.

cgilles commented 5 years ago

Robin : The scan process is done when importing image from camera. There is no XMP sidecar available. Sure we can generate file at import stage, but this will increase again the scan processing time.

As i already said, sidecar support is optional through setup metadata panel. So in case of users want to store more data in image collection than real item, the problem still here, and this use case need to be integrated for future profiling improvements.

Gilles Caulier

cgilles commented 5 years ago

Robin,

I current re-write a lots to test tools present in digiKam core as real Qt unit tests, where all performance and profiling information will be avaialble de-facto. Qt::QTest framework is really great and impressive :

http://doc.qt.io/qt-5/qtest-overview.html

One of my unit-test is a multi-threaded metadata read/writer to run over a huge collection of images. The tool has many goals: 1/ check the Exiv2 stability with huge processing. 2/ check stability of Exiv2 with multi-threading and multi-core support 3/ detect the bottleneck while reading or writing in image metadata

https://cgit.kde.org/digikam.git/tree/core/tests/metadataengine/metareaderthread.cpp

In this test tool all database stage are removed. It use digiKam core multi-threaded API based on Qt (multi-platform stuff). For the small story, I use this API also in my office to process real time / fast acquisition measurements with a process control for a nuclear reactor. The code rock and is very stable in time (it never stop in fact).

I will port this code as an unit-test and i wil be able to give you real time processing value from Exiv2 in multi use cases.

Best

Gilles Caulier

clanmills commented 5 years ago

@cgilles Thanks very much. I'm sure you have some good ideas and technology here. Team Exiv2 will have a meeting in December after we release v0.27 to discuss projects for 2019. You are welcome to join us, or, to make a project request. I know you are interested stronger multi-core, multi-thread testing for Exiv2.

There are several major subjects already in focus:

1) Modernising the code and dropping support for C++98. 2) BigTiff and CR3 Support. 3) Completing the transition of our test suite from bash to python. 4) Enhanced lens recognition.

Will the team agree to your request? I don't know. I personally will not contribute code to Exiv2 in future. In 2019 and 2020, I'll do user support, release engineering, buildserver, documentation and web-site. We intend to provide regular "dot" releases to v0.27 for the next couple of years. I'll be 70 on January 18, 2021 and will finally retire!

@jmberg I implemented caching the file size to cut calls to stat() to one per file (when reading). I couldn't measure the performance difference in our archive of 7000 test images stored on our buildserver (14Gb) . Exiv2 reads all the metadata in 3.5 seconds (2000 images/second). When I do this from my laptop (and the buildserver is mounted via SMB), it takes 183 seconds. 60x slower. In JPEG files, the metadata is usually stored in the first 100k and because Exiv2 uses random I/O, the network I/O is much less than 14Gb (using nettop, I believe it's 120mb).

I doubt if there's a "quick code fix" to deal with your question "Why does digikam take hours to rescan my database?". I feel this has to addressed in product architecture and I have proposed possible solutions.

@jmberg May I ask you to close this topic as further action doesn't seem possible.

cgilles commented 5 years ago

"May I ask you to close this topic as further action doesn't seem possible."

Robin : please let's open this file until i give your profiling measurements with my multi-threaded unit-test. Depending of the results, this will guide you to the right way to improve some Exiv2 code later...

Gilles

Le sam. 3 nov. 2018 à 17:21, Robin Mills notifications@github.com a écrit :

@cgilles https://github.com/cgilles Thanks very much. I'm sure you have some good ideas and technology here. Team Exiv2 will have a meeting in December after we release v0.27 to discuss projects for 2019. You are welcome to join us, or, to make a project request. I know you are interested stronger multi-core, multi-thread testing for Exiv2.

There are several major subjects already in focus:

  1. Modernising the code and dropping support for C++98.
  2. BigTiff and CR3 Support.
  3. Completing the transition of our test suite from bash to python.
  4. Enhanced lens recognition.

Will the team agree to your request? I don't know. I personally will not contribute code to Exiv2 in future. In 2019 and 2020, I'll do user support, release engineering, buildserver, documentation and web-site. We intend to provide regular "dot" releases to v0.27 for the next couple of years. I'll be 70 on January 18, 2021 and will finally retire!

@jmberg https://github.com/jmberg I implemented caching the file size to cut calls to stat() to one per file (when reading). I couldn't measure the performance difference in our archive of 7000 test images stored on our buildserver (14Gb) . Exiv2 reads all the metadata in 3.5 seconds (2000 images/second). When I do this from my laptop (and the buildserver is mounted via SMB), it takes 183 seconds. 60x slower. In JPEG files, the metadata is usually stored in the first 100k and because Exiv2 uses random I/O, the network I/O is much less than 14Gb (using nettop, I believe it's 120mb).

I doubt if there's a "quick code fix" to deal with your question "Why does digikam take hours to rescan my database?". I feel this has to addressed in product architecture and I have proposed possible solutions.

@jmberg https://github.com/jmberg May I ask you to close this topic as further action doesn't seem possible.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Exiv2/exiv2/issues/515#issuecomment-435600628, or mute the thread https://github.com/notifications/unsubscribe-auth/ACd9ePEdEwlfA5XB0Tg0RsV0vjOZN245ks5urcKOgaJpZM4YMsZS .

clanmills commented 5 years ago

@cgilles Please open a new issue "Multi-threaded testing and profiling" and mark it as [Enhancement Request]. We'll keep that open and discuss it in December. I'm hoping to close all open issues on GitHub and Redmine for v0.27 GM. Redmine will be relocated to exiv2.dyndns.org read-only.

While I agree that your thoughts about testing and profiling are related to this issue, you are proposing a much larger initiative. The current subject has been investigated and discussed and therefore this matter is closed.

jmberg commented 5 years ago

Sorry, I wasn't around, saw the mails flying by but was out playing with the kids - now they're all asleep ... I definitely have more time to go until retirement ;-)

FWIW, I fully agree with closing this, I'd support closing it even after investigation without any code changes. My only goal was for you to be aware of this behavior in case you'd have wanted to fix it. As it turns out you have, but as I tried to say before, I'd be (have been) equally supportive of closing this as "it doesn't matter".

@clanmills, I'm not sure I see your change anywhere? Just curious about it, and perhaps also curious about looking at further PNG improvements. I don't have that much experience with C++, but I feel I'd be able to deal with it.

FWIW, regarding the original question of why digikam is so slow to rescan... I feel that I should elaborate a little, perhaps for the sake of posterity as well.

First of all, I think the whole thing probably happened to start with because I changed my username when changing computers, and thus my files appeared all moved around. I'm just glad that digikam actually seems to have preserved my database pretty faithfully across this.

Secondly, I think the reason for it being so slow is what I said above: basically the need to ship a humongous amount of data across the network, given my particular setup. Yes, the paths still changed, because I use a symlink in my home to access the mounted share.

Finally, the need to read a humongous amount of data (and in my case thus ship it across the network) is evidently inherent in the PNG spec. Support for the eXIf chunk (http://ftp-osl.osuosl.org/pub/libpng/documents/pngext-1.5.0.html#C.eXIf) isn't in exiv2 (master branch) yet as far as I can tell, and supporting just that clearly wouldn't be sufficient either since old files still need to be supported. Therefore, reading the whole file seems to unfortunately be necessary for any PNG file. It's just bad for me that I happen to have ~1500 PNG files with ~166GB in my libary, out of ~50k files with a total of 375GB... As you said: "We have to read the whole file to realise there is NO metadata."

FWIW, also note that creating a sidecar file with an md5sum is pointless, to actually calculate the md5sum you have to read the whole file. The better fix would've been to not have to re-read the whole library, but like I described above that was likely a one-time event caused by the username change.

Since you asked - I've uploaded a sample file (was far too big for github) to https://johannes.sipsolutions.net/files/sample-big-ruegen.png (~117MB). There's nothing really special about it though, it's just the raw data off my film scanner. Feel free to use it for any purpose you see fit, including publishing (e.g. in a set of test images).

jmberg commented 5 years ago

I've also played with adding a cache to FileIo::size() myself, and that does seem to improve the speed by a factor of >2x (~200ms -> ~60ms) for me on an NFS network filesystem (with the data in local cache), perhaps for stat() NFS has to go talk to the server? I very much doubt this would really have helped my original case though, since of course there I had to copy all the data and that was the real issue.

Another thing I just tried was making the IO unbuffered (std::setvbuf(fp_, NULL, _IONBF, 0); and std::setvbuf(p_->fp_, NULL, _IONBF, 0);) in order to avoid having glibc read 4k/8k data for each chunk, but that also doesn't really seem to have made any difference. I thought that might help since the IDAT chunks are 32K and thus we only need to read 8 bytes to skip them rather than reading 4k/8k, but I guess the NFS wanted to pull down a full page or so anyway.

jmberg commented 5 years ago

Anyway, as far as I'm concerned there's no real bug here, it's just inherently slow.

clanmills commented 5 years ago

@jmberg Thank you for all the feedback here. I'm glad that you're happy to close this.

To make progress requires a change in the architecture of the application.

1) Drastically reduce the network I/O be reading the sidecar.

I don't see any reason which digiKam can't have an option to keep a sidecar parallel to the image. If it's younger than the image, read the metadata from the side car. Exiv2 can extract a .exv file. I'm not sure that Gilles has understood my suggestion.

2) Use a web-service to fetch the metadata from the image. The reading of the image would be performed on the server, not in digiKam! Perhaps one day somebody will volunteer for this project.

Anyway. Thanks very much for discussing this. It's all very interesting. It all takes time!

clanmills commented 5 years ago

I didn't submit the code with size cacheing. I played with it locally on my machine and then removed it.

clanmills commented 5 years ago

@jmberg Thanks for your test file.

I've played with that and the size/cache patch. I agree, this does some good.

However, I don't want to modify this code. We're at Exiv2 v0.27 RC1 and we're focused on "must fix" and I feel this is "nice to have". And while I've quickly created a patch, our test suite is failing all over the place. This is a "proof of concept" and will need more work for which I don't have time.

The results are interesting:

rmills@rmillsmm-ubuntu:~/gnu/github/exiv2/exiv2/build$ time strace exiv2 -pa --grep datetimeoriginal/i ~/sample-big-ruegen.png 2>&1 | grep ^stat | grep sample | wc
   7524   45144  714780

real    0m1.593s
user    0m0.397s
sys     0m1.498s
rmills@rmillsmm-ubuntu:~/gnu/github/exiv2/exiv2/build$ time strace bin/exiv2 -pa --grep datetimeoriginal/i ~/sample-big-ruegen.png 2>&1 | grep ^stat | grep sample | wc
      2      12     190

real    0m0.901s
user    0m0.140s
sys     0m0.942s
rmills@rmillsmm-ubuntu:~/gnu/github/exiv2/exiv2/build$ 

Cutting the stat() calls from 7000+ to 2+ improves performance by 50% for your pathological file.

However on a JPEG, it cuts stat() from 7 to 2 and doesn't buy much:

rmills@rmillsmm-ubuntu:~/gnu/github/exiv2/exiv2/build$ time strace bin/exiv2 -pa --grep datetimeoriginal/i ~/Stonehenge.jpg 2>&1 | grep ^stat | grep Ston | wc
      2      12     172

real    0m0.036s
user    0m0.013s
sys     0m0.025s
rmills@rmillsmm-ubuntu:~/gnu/github/exiv2/exiv2/build$ time strace exiv2 -pa --grep datetimeoriginal/i ~/Stonehenge.jpg 2>&1 | grep ^stat | grep Ston | wc
      7      42     602

real    0m0.033s
user    0m0.013s
sys     0m0.022s
rmills@rmillsmm-ubuntu:~/gnu/github/exiv2/exiv2/build$ 
diff --git a/include/exiv2/basicio.hpp b/include/exiv2/basicio.hpp
index df3707af..7095d7de 100644
--- a/include/exiv2/basicio.hpp
+++ b/include/exiv2/basicio.hpp
@@ -517,6 +517,8 @@ namespace Exiv2 {
         //@}

     private:
+        bool   _bSizeKnown;
+        size_t _size    ;
         // NOT IMPLEMENTED
         //! Copy constructor
         FileIo(FileIo& rhs);
diff --git a/src/basicio.cpp b/src/basicio.cpp
index 97137b94..6cefc334 100644
--- a/src/basicio.cpp
+++ b/src/basicio.cpp
@@ -357,12 +357,14 @@ namespace Exiv2 {

     FileIo::FileIo(const std::string& path)
         : p_(new Impl(path))
+        , _bSizeKnown(false)
     {
     }

 #ifdef EXV_UNICODE_PATH
     FileIo::FileIo(const std::wstring& wpath)
         : p_(new Impl(wpath))
+        , _bSizeKnown(false)
     {
     }

@@ -575,6 +577,7 @@ namespace Exiv2 {

     long FileIo::write(const byte* data, long wcount)
     {
+        _bSizeKnown = false;
         assert(p_->fp_ != 0);
         if (p_->switchMode(Impl::opWrite) != 0) return 0;
         return (long)std::fwrite(data, 1, wcount, p_->fp_);
@@ -582,6 +585,7 @@ namespace Exiv2 {

     long FileIo::write(BasicIo& src)
     {
+        _bSizeKnown = false;
         assert(p_->fp_ != 0);
         if (static_cast<BasicIo*>(this) == &src) return 0;
         if (!src.isopen()) return 0;
@@ -605,6 +609,7 @@ namespace Exiv2 {

     void FileIo::transfer(BasicIo& src)
     {
+        _bSizeKnown = false;
         const bool wasOpen = (p_->fp_ != 0);
         const std::string lastMode(p_->openMode_);

@@ -928,6 +933,8 @@ namespace Exiv2 {

     size_t FileIo::size() const
     {
+        if ( _bSizeKnown ) return _size;
+
         // Flush and commit only if the file is open for writing
         if (p_->fp_ != 0 && (p_->openMode_[0] != 'r' || p_->openMode_[1] == '+')) {
             std::fflush(p_->fp_);
@@ -965,6 +972,8 @@ namespace Exiv2 {
             p_->fp_ = ::fopen(path().c_str(), mode.c_str());
         }
         if (!p_->fp_) return 1;
+        _size = size();
+        _bSizeKnown = true;
         return 0;
     }
jmberg commented 5 years ago

However, I don't want to modify this code. We're at Exiv2 v0.27 RC1 and we're focused on "must fix" and I feel this is "nice to have". And while I've quickly created a patch, our test suite is failing all over the place. This is a "proof of concept" and will need more work for which I don't have time.

No disagreement with this, obviously. I'm surprised it fails the test suite, does it modify files in place?

Cutting the stat() calls from 7000+ to 2+ improves performance by 50% for your pathological file.

That roughly matches my observation.

However on a JPEG, it cuts stat() from 7 to 2 and doesn't buy much:

As you would expect :-)

[snip]
>      long FileIo::write(const byte* data, long wcount)
>      {
> +        _bSizeKnown = false;

Good one, I missed this entirely in my patch, but only tested read, not write.

>      size_t FileIo::size() const
>      {
> +        if ( _bSizeKnown ) return _size;
> +
>          // Flush and commit only if the file is open for writing
>          if (p_->fp_ != 0 && (p_->openMode_[0] != 'r' || p_->openMode_[1] == '+')) {
>              std::fflush(p_->fp_);

I actually just put the code here, basically caching the result after the first size() call. Not sure if it makes a difference, but it felt easier and cleaner to me (and avoids the stat call in case you never need it, but I guess that won't matter). Something like

if (!_bSizeKnown) {
  ... existing code except set _size instead of returning it
}
return _size;
piponazo commented 5 years ago

I'm hoping to close all open issues on GitHub and Redmine for v0.27 GM. Redmine will be relocated to exiv2.dyndns.org read-only.

@clanmills this is simply impossible. There are many open issues that are not targeted for v0.27 and we should not close them. For example, I could not follow the conversation going on here because it was closed and I did not see it (Normally I just check the open issues ...).

The conversation going on is interesting, and I'm sure we would like to improve the situation for the future Exiv2 versions. I propose to reopen the issue so it does not get forgotten in the closed issues.

clanmills commented 5 years ago

Ah yes. However we're going to defer issues.

With Redmine, the UI allows you to set a target version for every issue. So my usual procedure is to deal with everything that I can and defer stuff that cannot be fixed with for the release.

I believe we're quite close to v0.27. Most of the open issues are [Feature Request] and other matters that cannot be closed for v0.27. I agree that there is something quite interesting being discussed here, several matters are clear: 1) This is a "nice to have, one day" issue. 2) We're not going to delay v0.27 to deal with this. 3) The code in question has been in service for years and shouldn't be modified so close to release.

I'd be happy to re-open this issue if it can be labelled "[Future Enhancement]" or something like that.

piponazo commented 5 years ago

Sure, we just should not assign it to the project 0.27.

clanmills commented 5 years ago

I haven't studied GitHub enough to know how it organises projects and versions. Can you assign this (and some other open issues) to the 0.28 project, please?

cgilles commented 5 years ago

I'm back here with first results :

I complete the first stage of my unit test code to run Exiv2 through the Digikam API :

On my computer :

Note that Exiv2 0.26, is compiled with XMP (internal), PNG, LENSDATA, NLS, UCS2, and SHARED cmake options. all others are disabled.

The photos collection is located in the /mnt/data/photos

mount => /dev/sde1 on /mnt/data type ext4 (rw,relatime,data=ordered) du => 288 Gb !

Now the unit test :

Code is here :

https://cgit.kde.org/digikam.git/tree/core/tests/metadataengine/metareaderthreadtest.cpp

Now the collection contents is this one :

Images: JP2: 30 JPG: 14945 KRA: 2 PGF: 23 PNG: 701 PPM: 10 PSD: 16 RAW-ARW: 2364 RAW-CR2: 166 RAW-CRW: 15 RAW-DCR: 3 RAW-DNG: 131 RAW-MRW: 60 RAW-NEF: 54 RAW-ORF: 8 RAW-PEF: 17 RAW-RAF: 517 RAW-RAW: 7 RAW-RW2: 15 RAW-RWL: 2 RAW-X3F: 5 TIFF: 163 XCF: 5 total: 19259 : Videos: 3GP: 2 AVI: 662 MKV: 1754 MOV: 19 MP4: 59 MPEG: 5 MTS: 1 WMF: 1 total: 2503 : Audio: MP3: 1 total: 1 : Total Items: 21763 Albums: 1358 Path: /mnt/data/photos/

After 5 minutes to process data (60% done), It crash in a PNG image :

Thread 12 (Thread 0x7f156affd700 (LWP 29075)):

0 0x00007f159ee2b89d in read () at /usr/lib64/qt5/../libc.so.6

1 0x00007f159edb938b in GIIO_file_seekoff () at /usr/lib64/qt5/../libc.so.6

2 0x00007f159edb6c69 in fseek () at /usr/lib64/qt5/../libc.so.6

3 0x00007f15a73d7af3 in Exiv2::PngImage::readMetadata() () at /usr/lib64/qt5/../libexiv2.so.26

4 0x00007f15adc26de3 in Digikam::MetaEngine::load(QString const&) (this=0x7f156affcb40, filePath=...) at /mnt/devel/GIT/6.x/core/libs/metadataengine/engine/metaengine_fileio.cpp:103

    image = {_M_ptr = 0x7f15580214e0}
    hasLoaded = false
    __PRETTY_FUNCTION__ = "virtual bool Digikam::MetaEngine::load(const QString&)"

5 0x00007f15adc6415d in Digikam::DMetadata::load(QString const&) (this=0x7f156affcb40, filePath=...) at /mnt/devel/GIT/6.x/core/libs/metadataengine/dmetadata/dmetadata_fileio.cpp:56

    lock = {d = 0x7f15581397f0}

6 0x000000000040f598 in Mytask::run() (this=0x188b600) at /mnt/devel/GIT/6.x/core/tests/metadataengine/metareaderthreadtest.cpp:62

    __PRETTY_FUNCTION__ = "virtual void Mytask::run()"
    meta = {<Digikam::MetaEngine> = {_vptr.MetaEngine = 0x7f15aeb5bef0 <vtable for Digikam::DMetadata+16>, d = 0x7f1558021480}, <No data fields>}

7 0x00007f159fa37447 in QThreadPoolThread::run() () at /usr/lib64/qt5/../libQt5Core.so.5

8 0x00007f159fa3aeea in QThreadPrivate::start(void*) () at /usr/lib64/qt5/../libQt5Core.so.5

9 0x00007f15ac17f66d in start_thread () at /usr/lib64/qt5/../libpthread.so.0

10 0x00007f159ee3be0d in clone () at /usr/lib64/qt5/../libc.so.6

So, i need to use Exiv2

If 60% is done in few minutes, the performance are not too bad after all. The amount of memory allocated is stable to 3.2 Gb (all process).

https://www.flickr.com/photos/digikam/45768905291/in/dateposted/

So next stage : use Exiv2 0.27 from git/master, with exactly the same context.

Gilles Caulier

clanmills commented 5 years ago

Gilles:

May I ask you to open a issue and we'll discuss it after we've completed v0.27. Luis and I are putting in long days (12+ hours/day) to get to v0.27.

We're talking about several different matters: 1) Johannes asked "why so many calls to stat()?". I've answered and he's satisfied that we close this issue.

2) I agreed to Luis to reopen this because the subject of caching FileIo::size() is worth considering in the future. This is really another matter. I'm going to open a new issue about that and close this issue.

3) I think you're talking about something concerning your 288Gbytes of test data. Can you discuss this in a different thread.

Robin

cgilles commented 5 years ago

No. look like with 0.27, there is also a problem with PNG file. This bug entry is about PNG, so my comment are valid, as with 0.27, i can also get this trace : (gdb) Thread 13 (Thread 0x7fd3e6ffd700 (LWP 17081)):

0 0x00007fd41b88689d in read () at /usr/lib64/qt5/../libc.so.6

1 0x00007fd41b81438b in GIIO_file_seekoff () at /usr/lib64/qt5/../libc.so.6

2 0x00007fd41b811c69 in fseek () at /usr/lib64/qt5/../libc.so.6

3 0x00007fd423d2a526 in Exiv2::FileIo::seek(long, Exiv2::BasicIo::Position) () at /usr/lib64/qt5/../libexiv2.so.27

4 0x00007fd423dce798 in Exiv2::PngImage::readMetadata() () at /usr/lib64/qt5/../libexiv2.so.27

5 0x00007fd42a6cfe9d in Digikam::MetaEngine::load(QString const&) (this=0x7fd3e6ffcb40, filePath=...) at /mnt/devel/GIT/6.x/core/libs/metadataengine/engine/metaengine_fileio.cpp:103

    image = {_M_ptr = 0x7fd3e00a8890}
    hasLoaded = false
    __PRETTY_FUNCTION__ = "virtual bool Digikam::MetaEngine::load(const QString&)"

6 0x00007fd42a70d1bb in Digikam::DMetadata::load(QString const&) (this=0x7fd3e6ffcb40, filePath=...) at /mnt/devel/GIT/6.x/core/libs/metadataengine/dmetadata/dmetadata_fileio.cpp:56

    lock = {d = 0x7fd3e00bb170}

7 0x000000000040f598 in Mytask::run() (this=0x18d2760) at /mnt/devel/GIT/6.x/core/tests/metadataengine/metareaderthreadtest.cpp:62

    __PRETTY_FUNCTION__ = "virtual void Mytask::run()"
    meta = {<Digikam::MetaEngine> = {_vptr.MetaEngine = 0x7fd42b604ef0 <vtable for Digikam::DMetadata+16>, d = 0x7fd3e0092dc0}, <No data fields>}

8 0x00007fd41c492447 in QThreadPoolThread::run() () at /usr/lib64/qt5/../libQt5Core.so.5

9 0x00007fd41c495eea in QThreadPrivate::start(void*) () at /usr/lib64/qt5/../libQt5Core.so.5

10 0x00007fd428c2866d in start_thread () at /usr/lib64/qt5/../libpthread.so.0

11 0x00007fd41b896e0d in clone () at /usr/lib64/qt5/../libc.so.6

My stressing test show a fragility in PNG parser, it's clear. I need to found which file is relevant in my huge collection.

Also my test check the performances of Exiv2 implementations when it's stressing as under digiKam with multi-core support.

So let's me to identify the PNG file first before to conclude...

Gilles Caulier

jmberg commented 5 years ago

No. look like with 0.27, there is also a problem with PNG file. This bug entry is about PNG

Well, but this bug isn't about all possible problems with PNG :-)

Really, you should open a new issue, because in adding everything to the same issue lies only total confusion by everyone involved.

cgilles commented 5 years ago

Ok, this is my result : there is no corrupted image in fact.

I reduced the number of files to process to PNG, and i isolated the path where a lots of PNG files are stored, and, the crash is not located in a specific file.

In fact, the stressing test indicate a degradation somewhere in code. To much files processed at the same time fragilise the PNG parser.

If i process JPG + PNG it crash more quickly. If i process JPG + PNG + TIFF it crash more quickly again. If i remove PNG files to parse, it do not crash !

To conclude : PNG parser is the problem. So the original topic from this report is valid...

Gilles Caulier

Le mer. 7 nov. 2018 à 20:39, jmberg notifications@github.com a écrit :

No. look like with 0.27, there is also a problem with PNG file. This bug entry is about PNG

Well, but this bug isn't about all possible problems with PNG :-)

Really, you should open a new issue, because this way lies only total confusion by everyone involved.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Exiv2/exiv2/issues/515#issuecomment-436751481, or mute the thread https://github.com/notifications/unsubscribe-auth/ACd9eKLOSUsMhOVzz9zX577KKF1eOjZwks5uszcHgaJpZM4YMsZS .

jmberg commented 5 years ago

So the original topic from this report is valid...

No ... the original topic was that we call stat() (too) much. This isn't related. Maybe there's a leak, maybe there's a bug, maybe there's something else - but it's certainly a different issue.

cgilles commented 5 years ago

Not sure, a openning to much PNG files at the same time bloat the system. Look the crash trace, it's inside glibc. So it's a low level call, probably stat()...

Performance results without PNG :

MetaReader have been completed: Processing duration: 124.997 seconds Root path : "/mnt/data/photos/" Number of files : 33717 Direction : "Read from files" Type-mimes : (".bmp", ".bw", ".cur", ".eps", ".epsf", ".epsi", ".exr", ".gif", ".icns", ".ico", ".kra", ".ora", ".pbm", ".pcx", ".pgm", ".pic", "", ".ppm", ".psd", ".ras", ".rgb", ".rgba", ".sgi", ".svg", ".svgz", ".tga", ".wbmp", ".webp", ".xbm", ".xcf", ".xpm", ".tiff", ".tif", ".jpg", ".jpeg", ".jpe", ".pgf", ".bay", ".bmq", ".cr2", ".crw", ".cs1", ".dc2", ".dcr", ".dng", ".erf", ".fff", ".hdr", ".k25", ".kdc", ".mdc", ".mos", ".mrw", ".nef", ".orf", ".pef", ".pxn", ".raf", ".raw", ".rdc", ".sr2", ".srf", ".x3f", ".arw", ".3fr", ".cine", ".ia", ".kc2", ".mef", ".nrw", ".qtk", ".rw2", ".sti", ".rwl", ".srw", "") Engine settings : [MetaEngineSettingsContainer] exifRotate(true), exifSetOrientation(true), saveComments(false), saveDateTime(false), savePickLabel(false), saveColorLabel(false), saveRating(false), saveTemplate(false), saveTags(false), saveFaceTags(false), writeRawFiles(false), updateFileTimeStamp(true), rescanImageIfModified(false), clearMetadataIfRescan(false), useXMPSidecar4Reading(false), useLazySync(false), metadataWritingMode(1), rotationBehavior(QFlags(0x1|0x2|0x4)), sidecarExtensions(())

PASS : MetaReaderThreadTest::testMetaReaderThread() RESULT : MetaReaderThreadTest::testMetaReaderThread(): 124,956 msecs per iteration (total: 124,956, iterations: 1) PASS : MetaReaderThreadTest::cleanupTestCase() Totals: 3 passed, 0 failed, 0 skipped, 0 blacklisted, 273242ms Finished testing of MetaReaderThreadTest

124 seconds to parse 33717 : it's not too bad...

Gilles

piponazo commented 5 years ago

Thanks guys for all the feedback. For me, these kind of issues are really interesting to investigate. I might invest some time on this topic during the next days in the context of the new issue created by @clanmills (#530 ).

clanmills commented 5 years ago

This matter is closed. I'm locking this conversation because this is not helping us to reach Exiv2 v0.27.

I've opened a new issue: #530 concerning cacheing the file size and I would appreciate this being left until 2019 for attention. And like-wise, we appear to have starting a discussion about PNG files which I am asking you to defer.