Closed jmberg closed 5 years ago
moved from https://bugs.kde.org/show_bug.cgi?id=400604 (digikam) moved from http://dev.exiv2.org/issues/1374
Groan. I've no idea why this is true. I'm putting in 12 hour days at the moment to get to v0.27 for Christmas.
Here's what I see on my favourite test file (http://clanmills.com/Stonehenge.jpg) which I've converted to Stonehenge.png
rmills@rmillmbp-ubuntu:~/gnu/github/exiv2/exiv2/build$ strace -e stat exiv2 ~/Stonehenge.jpg 2>&1 | wc
20 175 1767
rmills@rmillmbp-ubuntu:~/gnu/github/exiv2/exiv2/build$ strace -e stat exiv2 /Home/Stonehenge.png 2>&1 | wc
18 167 1636
rmills@rmillmbp-ubuntu:~/gnu/github/exiv2/exiv2/build$
I'm surprised that basicio.cpp calls stat() more than once. However 20/18 system calls doesn't seem excessive. When I use the filter | grep fstat
, 17/16 calls to stat().
I've looked at this. BasicIo::eof() calls ::stat() to get the file size and compares it to the current file position. It could alternatively call feof(). However, the semantics of BasicIo::eof() are have you read all the data? and that's slightly different from ::feof()
which does not return true until you try to read past the end-of-file.
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
int main(int argc,const char* argv[])
{
if ( argc != 2 ) {
const char* prog = argv[0];
printf("usage: %s path\n",prog);
} else {
const char* path = argv[1];
struct stat st;
if ( ::stat(path,&st) == 0) {
printf("size = %lld\n",st.st_size);
char* buffer = (char*) malloc(st.st_size);
if ( buffer ) {
FILE* f = fopen(path,"r");
if ( f ) {
size_t n = fread(buffer,1,st.st_size,f);
printf("read %ld bytes\n",n);
printf("feof(f) = %d ftell(f) = %ld\n",feof(f),ftell(f));
fread(buffer,1,1,f);
printf("feof(f) = %d ftell(f) = %ld\n",feof(f),ftell(f));
} else {
printf("%s did not open",path);
}
}
} else {
printf("%s does not exist",path);
}
}
return 0;
}
553 rmills@rmillsmbp:~/temp $ ls -l eof.cpp ; make -B eof 2>/dev/null ; eof eof.cpp
-rw-r--r--@ 1 rmills staff 1027 3 Nov 10:52 eof.cpp
c++ eof.cpp -o eof
size = 1027
read 1027 bytes
feof(f) = 0 ftell(f) = 1027
feof(f) = 1 ftell(f) = 1027
554 rmills@rmillsmbp:~/temp $
You haven't provided evidence that this is causing a problem, so I'm very reluctant to change the code in this area. You're welcome to persuade me that this deserves attention.
Robin,
All optimizations in this area are very importants for the future. In digiKam, when you manage 100.000 through a sqlite database, or 200.000 and more with Mysql/Mariadb database, it's clear : there is a bottleneck in Exiv2.
This is more problematic when image collections are on a shared device, through the local network.
In digKam, the process to register all image information in database is simple :
1/ We scan the collections with Exiv2 through separated threads (read metadata). 2/ We make blobs to commit data in database. We don't register all metadata, but only the most important ones used to perform complex searches. 3/ When users make changes in database, a synchronization is performed with image metadata. This is done also through separated threads (write metadata).
1/ and 2/ are problematic due to large system calls by Exiv2.
So profilling this Exiv2 part is the way to do, in case of large collection of files to manage.
Best Gilles Caulier
Thank you for the background, Gilles. However that doesn't change my question "what's the problem?".
As you know, Gilles. The three active members of Team Exiv2 (myself, Luis and Dan) respond quickly and effectively to your requests (and to all our users). However we have to prioritise between issues that must be fixed (such as security) and nice to have. Do you consider this is something that must be fixed?
I don't agree that Exiv2 is slow. We read hundreds of images per second. Is it possible that digiKam is reading metadata excessively? It is possible to store the file stamp (or md5sum) in your database and only read the image when it has been modified?
And, as you know, I find interactions with you to be stressful as everything is "you must fix this now". The Team are all volunteers. I know that you are also a volunteer with digiKam so you're familiar with the pressures of this hobby. Perhaps you could make our interactions less stressful.
Umm, I feel I've gotten caught between fronts here :-)
TBH, I'm not really sure it's that much of a problem. I was trying to answer the question "Why does digikam take hours to rescan my database?", and the excessive stat() calls seemed like a potential issue.
In truth, the more likely "reason" is that
So in all likelihood I was blaming the wrong thing (stat() for every read()) when the real problem was reading the whole PNG files to start with. Mind you, I probably have hundreds similar files, so I'd have shipped dozens if not hundreds of GBs of data across my network doing this ...
For now, I'd definitely classify this as "would be nice to fix, at least to not get that blamed for performance issues erroneously". However, it's also true that there's a cost to syscalls, and contrary to history that cost has been going up recently with security mitigations, rather than down as in the past. So I'd even argue that for performance it would be good to read (PNG) files in larger than 8k chunks. I'd guess you could only see this difference on fast local drives though, my overall performance was definitely network-bound.
What a nice and thoughtful message. Thanks.
Perhaps digiKam could create sidecar files with all of the metadata. The sidecar could contain the date stamps (or md5sums) of the file and would avoid compulsively re-reading every image. The sidecars could be on local storage. Calculating md5sum presumably requires reading the whole file, so perhaps that's not helpful for your situation.
I haven't thought about the format of PNG files for some time. As you've said, it is a linked list of chunks. There's no index to enable the reader to quickly locate items of interest, or to know that there is no metadata.
Can you send me one of your files, please. You've motivated me to look further into this.
For sure, I need to instrument something to know if changes are improving performance. Here are things that I can try: 1) Modify BasicIo::eof() to use ::feof() and if it's false, do tell()==size() 2) Modify the BasicIo object to store the size if we don't write to the object. 3) Look at the png chunk scanner to see if it dies after it has found the Exif chunk (and ICC and XMP and IPTC). If you have lots of files with NO metadata, optimisation won't help. We have to read the whole file to realise there is NO metadata. 4) In investigating this, I'll have more and other ideas.
Team Exiv2 is busy working to achieve Exiv2 v0.27 by the end of 2018. I'm happy to have a little break today from cross-platform build/test/buildserver work to look at your issue.
I've had another three ideas!
1) Use a .exv sidecar
If digiKam stored a metadata .exv file beside the image, that would probably fix it. So if you have a file DSC_1234.png, the file DSC_1234.exv can provide the metadata. The code to create the sidecar already exists.
577 rmills@rmillsmbp:~ $ exiv2 -ea --verbose --force Stonehenge.png
File 1/1: Stonehenge.png
Writing XMP data from Stonehenge.png to ./Stonehenge.exv
578 rmills@rmillsmbp:~ $ ls -l Stonehenge*
-rw-r--r-- 1 rmills staff 45814 3 Nov 12:48 Stonehenge.exv
-rw-r--r--@ 1 rmills staff 6798775 4 Aug 22:34 Stonehenge.jpg
-rw-r--r--@ 1 rmills staff 31735186 3 Nov 09:12 Stonehenge.png
579 rmills@rmillsmbp:~ $
2) Remote Io
Exiv2 supports a RemoteIo
class. The derived class SshIo
class uses ssh. That would have the IO onto the ssh server and thereby avoid excessive network activity.
I don't believe anybody uses SshIo
because I've never had a single bug report or question about it! RemoteIo
is an abstract class and its principal derived class is HttpIo
. I believe HttpIo
is used a lot and the other classes FtpIo, SftpIo and others were implemented in the web-ready project.
Of course, digiKam would require GUI changes to enable you to configure appropriately. We'd need to know the mapping between NFS <-> ssh or NFS <-> http and store public keys and so on.
These classes are efficient. They use random I/O to seek and read. In the case of Ssh, it uses the command dd
on the server to read byte-ranges. However for your scenario with empty pngs, we're probably still in trouble as we have the read the whole file!
3) Use a web service
The specification for web-ready included a Metadata Web Service. We didn't implement this as we didn't have enough time. The idea is to get the metadata via http(s) and all the I/O is performed on the server. This doesn't exist. web-ready was done in 2013. I hoped to continue the project in 2014, however I had other matters in my life that demanded my time.
Robin,
"Perhaps digiKam could create sidecar files with all of the metadata. The sidecar could contain the date stamps (or md5sums) of the file and would avoid compulsively re-reading every image. The sidecars could be on local storage. Calculating md5sum presumably requires reading the whole file, so perhaps that's not helpful for your situation."
==> Definitively no. This is the goal of the database. You cannot perform search over huge collection with sidecar, without to wait few day. The database is fast and can be centralized to host metadata from agency : one server and many client connected to local network.
About sidecar, of course digiKam is able to use these files, but it's optional. Some user want to not touche original images, some other want to host all change in images, some other both. So the use cases are complex and inherited from the long digiKam experience (the project have been started in 2001). Just take a look to digiKam setting metadata panel to understand want i mean :
https://www.flickr.com/photos/digikam/31824318968 https://www.flickr.com/photos/digikam/44782449315 https://www.flickr.com/photos/digikam/31824318208 https://www.flickr.com/photos/digikam/44782448685 https://www.flickr.com/photos/digikam/31824316898
I recommend to try the application. We have stand alone bundles for you. Nothing is installed on your computer :
https://files.kde.org/digikam/
Best
Gilles Caulier
About remote IO and web service to share metadata : all is performed already by the database. Mariadb or Mysql can be already acceded through the network.
In digiKam we share more than image metadata :
See below the digiKam setup metadata and collection panels for details :
https://www.flickr.com/photos/digikam/30756263757 https://www.flickr.com/photos/digikam/44972055714 https://www.flickr.com/photos/digikam/31824521618 https://www.flickr.com/photos/digikam/44972055014 https://www.flickr.com/photos/digikam/31824520608
So, we need a real database. All serious applications to perform photo management use a database, as Apple, Adobe, etc.
Gilles Caulier
I agree with your design decision to use a database.
@jmberg is asking the question: "Why does digikam take hours to rescan my database?" I'm suggesting that keeping a sidecar beside the image would speed up the rescan process.
Robin : The scan process is done when importing image from camera. There is no XMP sidecar available. Sure we can generate file at import stage, but this will increase again the scan processing time.
As i already said, sidecar support is optional through setup metadata panel. So in case of users want to store more data in image collection than real item, the problem still here, and this use case need to be integrated for future profiling improvements.
Gilles Caulier
Robin,
I current re-write a lots to test tools present in digiKam core as real Qt unit tests, where all performance and profiling information will be avaialble de-facto. Qt::QTest framework is really great and impressive :
http://doc.qt.io/qt-5/qtest-overview.html
One of my unit-test is a multi-threaded metadata read/writer to run over a huge collection of images. The tool has many goals: 1/ check the Exiv2 stability with huge processing. 2/ check stability of Exiv2 with multi-threading and multi-core support 3/ detect the bottleneck while reading or writing in image metadata
https://cgit.kde.org/digikam.git/tree/core/tests/metadataengine/metareaderthread.cpp
In this test tool all database stage are removed. It use digiKam core multi-threaded API based on Qt (multi-platform stuff). For the small story, I use this API also in my office to process real time / fast acquisition measurements with a process control for a nuclear reactor. The code rock and is very stable in time (it never stop in fact).
I will port this code as an unit-test and i wil be able to give you real time processing value from Exiv2 in multi use cases.
Best
Gilles Caulier
@cgilles Thanks very much. I'm sure you have some good ideas and technology here. Team Exiv2 will have a meeting in December after we release v0.27 to discuss projects for 2019. You are welcome to join us, or, to make a project request. I know you are interested stronger multi-core, multi-thread testing for Exiv2.
There are several major subjects already in focus:
1) Modernising the code and dropping support for C++98. 2) BigTiff and CR3 Support. 3) Completing the transition of our test suite from bash to python. 4) Enhanced lens recognition.
Will the team agree to your request? I don't know. I personally will not contribute code to Exiv2 in future. In 2019 and 2020, I'll do user support, release engineering, buildserver, documentation and web-site. We intend to provide regular "dot" releases to v0.27 for the next couple of years. I'll be 70 on January 18, 2021 and will finally retire!
@jmberg I implemented caching the file size to cut calls to stat() to one per file (when reading). I couldn't measure the performance difference in our archive of 7000 test images stored on our buildserver (14Gb) . Exiv2 reads all the metadata in 3.5 seconds (2000 images/second). When I do this from my laptop (and the buildserver is mounted via SMB), it takes 183 seconds. 60x slower. In JPEG files, the metadata is usually stored in the first 100k and because Exiv2 uses random I/O, the network I/O is much less than 14Gb (using nettop, I believe it's 120mb).
I doubt if there's a "quick code fix" to deal with your question "Why does digikam take hours to rescan my database?". I feel this has to addressed in product architecture and I have proposed possible solutions.
@jmberg May I ask you to close this topic as further action doesn't seem possible.
"May I ask you to close this topic as further action doesn't seem possible."
Robin : please let's open this file until i give your profiling measurements with my multi-threaded unit-test. Depending of the results, this will guide you to the right way to improve some Exiv2 code later...
Gilles
Le sam. 3 nov. 2018 à 17:21, Robin Mills notifications@github.com a écrit :
@cgilles https://github.com/cgilles Thanks very much. I'm sure you have some good ideas and technology here. Team Exiv2 will have a meeting in December after we release v0.27 to discuss projects for 2019. You are welcome to join us, or, to make a project request. I know you are interested stronger multi-core, multi-thread testing for Exiv2.
There are several major subjects already in focus:
- Modernising the code and dropping support for C++98.
- BigTiff and CR3 Support.
- Completing the transition of our test suite from bash to python.
- Enhanced lens recognition.
Will the team agree to your request? I don't know. I personally will not contribute code to Exiv2 in future. In 2019 and 2020, I'll do user support, release engineering, buildserver, documentation and web-site. We intend to provide regular "dot" releases to v0.27 for the next couple of years. I'll be 70 on January 18, 2021 and will finally retire!
@jmberg https://github.com/jmberg I implemented caching the file size to cut calls to stat() to one per file (when reading). I couldn't measure the performance difference in our archive of 7000 test images stored on our buildserver (14Gb) . Exiv2 reads all the metadata in 3.5 seconds (2000 images/second). When I do this from my laptop (and the buildserver is mounted via SMB), it takes 183 seconds. 60x slower. In JPEG files, the metadata is usually stored in the first 100k and because Exiv2 uses random I/O, the network I/O is much less than 14Gb (using nettop, I believe it's 120mb).
I doubt if there's a "quick code fix" to deal with your question "Why does digikam take hours to rescan my database?". I feel this has to addressed in product architecture and I have proposed possible solutions.
@jmberg https://github.com/jmberg May I ask you to close this topic as further action doesn't seem possible.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Exiv2/exiv2/issues/515#issuecomment-435600628, or mute the thread https://github.com/notifications/unsubscribe-auth/ACd9ePEdEwlfA5XB0Tg0RsV0vjOZN245ks5urcKOgaJpZM4YMsZS .
@cgilles Please open a new issue "Multi-threaded testing and profiling" and mark it as [Enhancement Request]. We'll keep that open and discuss it in December. I'm hoping to close all open issues on GitHub and Redmine for v0.27 GM. Redmine will be relocated to exiv2.dyndns.org read-only.
While I agree that your thoughts about testing and profiling are related to this issue, you are proposing a much larger initiative. The current subject has been investigated and discussed and therefore this matter is closed.
Sorry, I wasn't around, saw the mails flying by but was out playing with the kids - now they're all asleep ... I definitely have more time to go until retirement ;-)
FWIW, I fully agree with closing this, I'd support closing it even after investigation without any code changes. My only goal was for you to be aware of this behavior in case you'd have wanted to fix it. As it turns out you have, but as I tried to say before, I'd be (have been) equally supportive of closing this as "it doesn't matter".
@clanmills, I'm not sure I see your change anywhere? Just curious about it, and perhaps also curious about looking at further PNG improvements. I don't have that much experience with C++, but I feel I'd be able to deal with it.
FWIW, regarding the original question of why digikam is so slow to rescan... I feel that I should elaborate a little, perhaps for the sake of posterity as well.
First of all, I think the whole thing probably happened to start with because I changed my username when changing computers, and thus my files appeared all moved around. I'm just glad that digikam actually seems to have preserved my database pretty faithfully across this.
Secondly, I think the reason for it being so slow is what I said above: basically the need to ship a humongous amount of data across the network, given my particular setup. Yes, the paths still changed, because I use a symlink in my home to access the mounted share.
Finally, the need to read a humongous amount of data (and in my case thus ship it across the network) is evidently inherent in the PNG spec. Support for the eXIf chunk (http://ftp-osl.osuosl.org/pub/libpng/documents/pngext-1.5.0.html#C.eXIf) isn't in exiv2 (master branch) yet as far as I can tell, and supporting just that clearly wouldn't be sufficient either since old files still need to be supported. Therefore, reading the whole file seems to unfortunately be necessary for any PNG file. It's just bad for me that I happen to have ~1500 PNG files with ~166GB in my libary, out of ~50k files with a total of 375GB... As you said: "We have to read the whole file to realise there is NO metadata."
FWIW, also note that creating a sidecar file with an md5sum is pointless, to actually calculate the md5sum you have to read the whole file. The better fix would've been to not have to re-read the whole library, but like I described above that was likely a one-time event caused by the username change.
Since you asked - I've uploaded a sample file (was far too big for github) to https://johannes.sipsolutions.net/files/sample-big-ruegen.png (~117MB). There's nothing really special about it though, it's just the raw data off my film scanner. Feel free to use it for any purpose you see fit, including publishing (e.g. in a set of test images).
I've also played with adding a cache to FileIo::size() myself, and that does seem to improve the speed by a factor of >2x (~200ms -> ~60ms) for me on an NFS network filesystem (with the data in local cache), perhaps for stat() NFS has to go talk to the server? I very much doubt this would really have helped my original case though, since of course there I had to copy all the data and that was the real issue.
Another thing I just tried was making the IO unbuffered (std::setvbuf(fp_, NULL, _IONBF, 0);
and std::setvbuf(p_->fp_, NULL, _IONBF, 0);
) in order to avoid having glibc read 4k/8k data for each chunk, but that also doesn't really seem to have made any difference. I thought that might help since the IDAT chunks are 32K and thus we only need to read 8 bytes to skip them rather than reading 4k/8k, but I guess the NFS wanted to pull down a full page or so anyway.
Anyway, as far as I'm concerned there's no real bug here, it's just inherently slow.
@jmberg Thank you for all the feedback here. I'm glad that you're happy to close this.
To make progress requires a change in the architecture of the application.
1) Drastically reduce the network I/O be reading the sidecar.
I don't see any reason which digiKam can't have an option to keep a sidecar parallel to the image. If it's younger than the image, read the metadata from the side car. Exiv2 can extract a .exv file. I'm not sure that Gilles has understood my suggestion.
2) Use a web-service to fetch the metadata from the image. The reading of the image would be performed on the server, not in digiKam! Perhaps one day somebody will volunteer for this project.
Anyway. Thanks very much for discussing this. It's all very interesting. It all takes time!
I didn't submit the code with size cacheing. I played with it locally on my machine and then removed it.
@jmberg Thanks for your test file.
I've played with that and the size/cache patch. I agree, this does some good.
However, I don't want to modify this code. We're at Exiv2 v0.27 RC1 and we're focused on "must fix" and I feel this is "nice to have". And while I've quickly created a patch, our test suite is failing all over the place. This is a "proof of concept" and will need more work for which I don't have time.
The results are interesting:
rmills@rmillsmm-ubuntu:~/gnu/github/exiv2/exiv2/build$ time strace exiv2 -pa --grep datetimeoriginal/i ~/sample-big-ruegen.png 2>&1 | grep ^stat | grep sample | wc
7524 45144 714780
real 0m1.593s
user 0m0.397s
sys 0m1.498s
rmills@rmillsmm-ubuntu:~/gnu/github/exiv2/exiv2/build$ time strace bin/exiv2 -pa --grep datetimeoriginal/i ~/sample-big-ruegen.png 2>&1 | grep ^stat | grep sample | wc
2 12 190
real 0m0.901s
user 0m0.140s
sys 0m0.942s
rmills@rmillsmm-ubuntu:~/gnu/github/exiv2/exiv2/build$
Cutting the stat() calls from 7000+ to 2+ improves performance by 50% for your pathological file.
However on a JPEG, it cuts stat() from 7 to 2 and doesn't buy much:
rmills@rmillsmm-ubuntu:~/gnu/github/exiv2/exiv2/build$ time strace bin/exiv2 -pa --grep datetimeoriginal/i ~/Stonehenge.jpg 2>&1 | grep ^stat | grep Ston | wc
2 12 172
real 0m0.036s
user 0m0.013s
sys 0m0.025s
rmills@rmillsmm-ubuntu:~/gnu/github/exiv2/exiv2/build$ time strace exiv2 -pa --grep datetimeoriginal/i ~/Stonehenge.jpg 2>&1 | grep ^stat | grep Ston | wc
7 42 602
real 0m0.033s
user 0m0.013s
sys 0m0.022s
rmills@rmillsmm-ubuntu:~/gnu/github/exiv2/exiv2/build$
diff --git a/include/exiv2/basicio.hpp b/include/exiv2/basicio.hpp
index df3707af..7095d7de 100644
--- a/include/exiv2/basicio.hpp
+++ b/include/exiv2/basicio.hpp
@@ -517,6 +517,8 @@ namespace Exiv2 {
//@}
private:
+ bool _bSizeKnown;
+ size_t _size ;
// NOT IMPLEMENTED
//! Copy constructor
FileIo(FileIo& rhs);
diff --git a/src/basicio.cpp b/src/basicio.cpp
index 97137b94..6cefc334 100644
--- a/src/basicio.cpp
+++ b/src/basicio.cpp
@@ -357,12 +357,14 @@ namespace Exiv2 {
FileIo::FileIo(const std::string& path)
: p_(new Impl(path))
+ , _bSizeKnown(false)
{
}
#ifdef EXV_UNICODE_PATH
FileIo::FileIo(const std::wstring& wpath)
: p_(new Impl(wpath))
+ , _bSizeKnown(false)
{
}
@@ -575,6 +577,7 @@ namespace Exiv2 {
long FileIo::write(const byte* data, long wcount)
{
+ _bSizeKnown = false;
assert(p_->fp_ != 0);
if (p_->switchMode(Impl::opWrite) != 0) return 0;
return (long)std::fwrite(data, 1, wcount, p_->fp_);
@@ -582,6 +585,7 @@ namespace Exiv2 {
long FileIo::write(BasicIo& src)
{
+ _bSizeKnown = false;
assert(p_->fp_ != 0);
if (static_cast<BasicIo*>(this) == &src) return 0;
if (!src.isopen()) return 0;
@@ -605,6 +609,7 @@ namespace Exiv2 {
void FileIo::transfer(BasicIo& src)
{
+ _bSizeKnown = false;
const bool wasOpen = (p_->fp_ != 0);
const std::string lastMode(p_->openMode_);
@@ -928,6 +933,8 @@ namespace Exiv2 {
size_t FileIo::size() const
{
+ if ( _bSizeKnown ) return _size;
+
// Flush and commit only if the file is open for writing
if (p_->fp_ != 0 && (p_->openMode_[0] != 'r' || p_->openMode_[1] == '+')) {
std::fflush(p_->fp_);
@@ -965,6 +972,8 @@ namespace Exiv2 {
p_->fp_ = ::fopen(path().c_str(), mode.c_str());
}
if (!p_->fp_) return 1;
+ _size = size();
+ _bSizeKnown = true;
return 0;
}
However, I don't want to modify this code. We're at Exiv2 v0.27 RC1 and we're focused on "must fix" and I feel this is "nice to have". And while I've quickly created a patch, our test suite is failing all over the place. This is a "proof of concept" and will need more work for which I don't have time.
No disagreement with this, obviously. I'm surprised it fails the test suite, does it modify files in place?
Cutting the stat() calls from 7000+ to 2+ improves performance by 50% for your pathological file.
That roughly matches my observation.
However on a JPEG, it cuts stat() from 7 to 2 and doesn't buy much:
As you would expect :-)
[snip]
> long FileIo::write(const byte* data, long wcount)
> {
> + _bSizeKnown = false;
Good one, I missed this entirely in my patch, but only tested read, not write.
> size_t FileIo::size() const
> {
> + if ( _bSizeKnown ) return _size;
> +
> // Flush and commit only if the file is open for writing
> if (p_->fp_ != 0 && (p_->openMode_[0] != 'r' || p_->openMode_[1] == '+')) {
> std::fflush(p_->fp_);
I actually just put the code here, basically caching the result after the first size() call. Not sure if it makes a difference, but it felt easier and cleaner to me (and avoids the stat call in case you never need it, but I guess that won't matter). Something like
if (!_bSizeKnown) {
... existing code except set _size instead of returning it
}
return _size;
I'm hoping to close all open issues on GitHub and Redmine for v0.27 GM. Redmine will be relocated to exiv2.dyndns.org read-only.
@clanmills this is simply impossible. There are many open issues that are not targeted for v0.27 and we should not close them. For example, I could not follow the conversation going on here because it was closed and I did not see it (Normally I just check the open issues ...).
The conversation going on is interesting, and I'm sure we would like to improve the situation for the future Exiv2 versions. I propose to reopen the issue so it does not get forgotten in the closed issues.
Ah yes. However we're going to defer issues.
With Redmine, the UI allows you to set a target version for every issue. So my usual procedure is to deal with everything that I can and defer stuff that cannot be fixed with for the release.
I believe we're quite close to v0.27. Most of the open issues are [Feature Request] and other matters that cannot be closed for v0.27. I agree that there is something quite interesting being discussed here, several matters are clear: 1) This is a "nice to have, one day" issue. 2) We're not going to delay v0.27 to deal with this. 3) The code in question has been in service for years and shouldn't be modified so close to release.
I'd be happy to re-open this issue if it can be labelled "[Future Enhancement]" or something like that.
Sure, we just should not assign it to the project 0.27.
I haven't studied GitHub enough to know how it organises projects and versions. Can you assign this (and some other open issues) to the 0.28 project, please?
I'm back here with first results :
I complete the first stage of my unit test code to run Exiv2 through the Digikam API :
On my computer :
Note that Exiv2 0.26, is compiled with XMP (internal), PNG, LENSDATA, NLS, UCS2, and SHARED cmake options. all others are disabled.
The photos collection is located in the /mnt/data/photos
mount => /dev/sde1 on /mnt/data type ext4 (rw,relatime,data=ordered) du => 288 Gb !
Now the unit test :
Code is here :
https://cgit.kde.org/digikam.git/tree/core/tests/metadataengine/metareaderthreadtest.cpp
Now the collection contents is this one :
Images: JP2: 30 JPG: 14945 KRA: 2 PGF: 23 PNG: 701 PPM: 10 PSD: 16 RAW-ARW: 2364 RAW-CR2: 166 RAW-CRW: 15 RAW-DCR: 3 RAW-DNG: 131 RAW-MRW: 60 RAW-NEF: 54 RAW-ORF: 8 RAW-PEF: 17 RAW-RAF: 517 RAW-RAW: 7 RAW-RW2: 15 RAW-RWL: 2 RAW-X3F: 5 TIFF: 163 XCF: 5 total: 19259 : Videos: 3GP: 2 AVI: 662 MKV: 1754 MOV: 19 MP4: 59 MPEG: 5 MTS: 1 WMF: 1 total: 2503 : Audio: MP3: 1 total: 1 : Total Items: 21763 Albums: 1358 Path: /mnt/data/photos/
After 5 minutes to process data (60% done), It crash in a PNG image :
Thread 12 (Thread 0x7f156affd700 (LWP 29075)):
image = {_M_ptr = 0x7f15580214e0}
hasLoaded = false
__PRETTY_FUNCTION__ = "virtual bool Digikam::MetaEngine::load(const QString&)"
lock = {d = 0x7f15581397f0}
__PRETTY_FUNCTION__ = "virtual void Mytask::run()"
meta = {<Digikam::MetaEngine> = {_vptr.MetaEngine = 0x7f15aeb5bef0 <vtable for Digikam::DMetadata+16>, d = 0x7f1558021480}, <No data fields>}
So, i need to use Exiv2
If 60% is done in few minutes, the performance are not too bad after all. The amount of memory allocated is stable to 3.2 Gb (all process).
https://www.flickr.com/photos/digikam/45768905291/in/dateposted/
So next stage : use Exiv2 0.27 from git/master, with exactly the same context.
Gilles Caulier
Gilles:
May I ask you to open a issue and we'll discuss it after we've completed v0.27. Luis and I are putting in long days (12+ hours/day) to get to v0.27.
We're talking about several different matters: 1) Johannes asked "why so many calls to stat()?". I've answered and he's satisfied that we close this issue.
2) I agreed to Luis to reopen this because the subject of caching FileIo::size() is worth considering in the future. This is really another matter. I'm going to open a new issue about that and close this issue.
3) I think you're talking about something concerning your 288Gbytes of test data. Can you discuss this in a different thread.
Robin
No. look like with 0.27, there is also a problem with PNG file. This bug entry is about PNG, so my comment are valid, as with 0.27, i can also get this trace : (gdb) Thread 13 (Thread 0x7fd3e6ffd700 (LWP 17081)):
image = {_M_ptr = 0x7fd3e00a8890}
hasLoaded = false
__PRETTY_FUNCTION__ = "virtual bool Digikam::MetaEngine::load(const QString&)"
lock = {d = 0x7fd3e00bb170}
__PRETTY_FUNCTION__ = "virtual void Mytask::run()"
meta = {<Digikam::MetaEngine> = {_vptr.MetaEngine = 0x7fd42b604ef0 <vtable for Digikam::DMetadata+16>, d = 0x7fd3e0092dc0}, <No data fields>}
My stressing test show a fragility in PNG parser, it's clear. I need to found which file is relevant in my huge collection.
Also my test check the performances of Exiv2 implementations when it's stressing as under digiKam with multi-core support.
So let's me to identify the PNG file first before to conclude...
Gilles Caulier
No. look like with 0.27, there is also a problem with PNG file. This bug entry is about PNG
Well, but this bug isn't about all possible problems with PNG :-)
Really, you should open a new issue, because in adding everything to the same issue lies only total confusion by everyone involved.
Ok, this is my result : there is no corrupted image in fact.
I reduced the number of files to process to PNG, and i isolated the path where a lots of PNG files are stored, and, the crash is not located in a specific file.
In fact, the stressing test indicate a degradation somewhere in code. To much files processed at the same time fragilise the PNG parser.
If i process JPG + PNG it crash more quickly. If i process JPG + PNG + TIFF it crash more quickly again. If i remove PNG files to parse, it do not crash !
To conclude : PNG parser is the problem. So the original topic from this report is valid...
Gilles Caulier
Le mer. 7 nov. 2018 à 20:39, jmberg notifications@github.com a écrit :
No. look like with 0.27, there is also a problem with PNG file. This bug entry is about PNG
Well, but this bug isn't about all possible problems with PNG :-)
Really, you should open a new issue, because this way lies only total confusion by everyone involved.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Exiv2/exiv2/issues/515#issuecomment-436751481, or mute the thread https://github.com/notifications/unsubscribe-auth/ACd9eKLOSUsMhOVzz9zX577KKF1eOjZwks5uszcHgaJpZM4YMsZS .
So the original topic from this report is valid...
No ... the original topic was that we call stat() (too) much. This isn't related. Maybe there's a leak, maybe there's a bug, maybe there's something else - but it's certainly a different issue.
Not sure, a openning to much PNG files at the same time bloat the system. Look the crash trace, it's inside glibc. So it's a low level call, probably stat()...
Performance results without PNG :
MetaReader have been completed: Processing duration: 124.997 seconds Root path : "/mnt/data/photos/" Number of files : 33717 Direction : "Read from files" Type-mimes : (".bmp", ".bw", ".cur", ".eps", ".epsf", ".epsi", ".exr", ".gif", ".icns", ".ico", ".kra", ".ora", ".pbm", ".pcx", ".pgm", ".pic", "", ".ppm", ".psd", ".ras", ".rgb", ".rgba", ".sgi", ".svg", ".svgz", ".tga", ".wbmp", ".webp", ".xbm", ".xcf", ".xpm", ".tiff", ".tif", ".jpg", ".jpeg", ".jpe", ".pgf", ".bay", ".bmq", ".cr2", ".crw", ".cs1", ".dc2", ".dcr", ".dng", ".erf", ".fff", ".hdr", ".k25", ".kdc", ".mdc", ".mos", ".mrw", ".nef", ".orf", ".pef", ".pxn", ".raf", ".raw", ".rdc", ".sr2", ".srf", ".x3f", ".arw", ".3fr", ".cine", ".ia", ".kc2", ".mef", ".nrw", ".qtk", ".rw2", ".sti", ".rwl", ".srw", "") Engine settings : [MetaEngineSettingsContainer] exifRotate(true), exifSetOrientation(true), saveComments(false), saveDateTime(false), savePickLabel(false), saveColorLabel(false), saveRating(false), saveTemplate(false), saveTags(false), saveFaceTags(false), writeRawFiles(false), updateFileTimeStamp(true), rescanImageIfModified(false), clearMetadataIfRescan(false), useXMPSidecar4Reading(false), useLazySync(false), metadataWritingMode(1), rotationBehavior(QFlags(0x1|0x2|0x4)), sidecarExtensions(())
PASS : MetaReaderThreadTest::testMetaReaderThread() RESULT : MetaReaderThreadTest::testMetaReaderThread(): 124,956 msecs per iteration (total: 124,956, iterations: 1) PASS : MetaReaderThreadTest::cleanupTestCase() Totals: 3 passed, 0 failed, 0 skipped, 0 blacklisted, 273242ms Finished testing of MetaReaderThreadTest
124 seconds to parse 33717 : it's not too bad...
Gilles
Thanks guys for all the feedback. For me, these kind of issues are really interesting to investigate. I might invest some time on this topic during the next days in the context of the new issue created by @clanmills (#530 ).
This matter is closed. I'm locking this conversation because this is not helping us to reach Exiv2 v0.27.
I've opened a new issue: #530 concerning cacheing the file size and I would appreciate this being left until 2019 for attention. And like-wise, we appear to have starting a discussion about PNG files which I am asking you to defer.
I've found that when reading PNG files, (lib)exiv2 calls stat() excessively often:
Yes, the file in question is 114MB, but there's no good reason to call stat() on it over and over since it's unlikely to change in the time it takes to read :-) (also, given the various relatively recent security mitigations, doing syscalls has gotten slower)
FWIW, this was a file without any exif data, perhaps if a chunk early in the file had any you could abort?