Closed HMAZonderland closed 6 years ago
Hi Hugo, there will definitely be some performance improvements we can make. Would it be possible for you to send me a version of your file?
I'm afraid I cannot post it here publicly because it contains pricing details between a client and his supplier. If you can guarantee me the file is not submitted here or anywhere else on the web I can sent it to you directly. I guess its to much work obfuscating the file contents as its this large.
Of course, please send to git@duncanc.co.uk
and I will ensure it goes no further
@HMAZonderland could you maybe anonymize this file and upload it? I am coding a python version of Craig's PHP edifact library, and would like to help here a bit or see if my version has the same problems - ? Alternatively, could you send a copy to github@nerdocs.at, I won't give it to anybody else neither (or I could try to anonymize it and, with your permission after your checking, publish the anonymized version for metroplex-systems/edifact and nerdocs/pydifact?
@duncan3dc I've send you the files, did you receive them? @nerdoc are you having performance issues with the Python version as well?
Never checked it, as I have no file larger than 10kb to test...
Hi, thanks. Although I'm testing with another lib, @duncan3dc (my Python one), I just did a profiling, with the first 100 lines of the file, thanks to @HMAZonderland - and it was interesting. Maybe you have similar results (as most of the code is a 1:1 transcode PHP->Python:
ncalls tottime percall cumtime percall filename:lineno(function)
890 4.647 0.005 4.647 0.005 tokenizer.py:85(get_next_char)
436 0.024 0.000 4.701 0.011 tokenizer.py:94(get_next_token)
1 0.022 0.022 0.038 0.038 {method 'read' of '_io.TextIOWrapper' objects}
1 0.017 0.017 4.746 4.746 parser.py:32(parse)
1 0.016 0.016 0.016 0.016 {built-in method _codecs.latin_1_decode}
1 0.015 0.015 0.015 0.015 {method 'lstrip' of 'str' objects}
890 0.014 0.000 4.661 0.005 tokenizer.py:67(read_next_char)
31 0.007 0.000 0.007 0.000 {built-in method marshal.loads}
791 0.007 0.000 0.007 0.000 tokenizer.py:134(is_control_character)
849 0.006 0.000 4.455 0.005 tokenizer.py:146(store_current_char_and_read_next)
1 0.004 0.004 4.713 4.713 tokenizer.py:44(get_tokens)
437 0.004 0.000 0.004 0.000 token.py:32(__init__)
1038 0.004 0.000 0.005 0.000 tokenizer.py:160(end_of_message)
110/107 0.003 0.000 0.005 0.000 {built-in method builtins.__build_class__}
36/1 0.003 0.000 4.815 4.815 {built-in method builtins.exec}
1492/1479 0.002 0.000 0.002 0.000 {built-in method builtins.len}
So you can see the get_next_char
needs most of the time. That one is worth optimizing ;-)
I think PHP will be similar.
The first issue in PHP is the multibyte support, that is taking 50% of the time
ok. But have a look at your function:
private function getNextChar()
{
$char = mb_substr($this->message, 0, 1);
$this->message = mb_substr($this->message, 1);
return $char;
}
I'm not familiar with PHP as I am with Python, but mb_substr
makes a copy of the string, right? so you copy the whole bunch of 10Mb string into another, cutting 2 bytes from it. And that thousands of times.
I had a fix in 2 minutes - I'll post my commit in my library. Just made an index counter to iterate over the original message. So it is just one message in memory.
100 lines before: 5-6 sec, after: <10ms.
@HMAZonderland, your test file with 2MB is parsed in 9sec now. That's not brilliant, but better than the >10minutes before ;-)
https://github.com/nerdocs/pydifact/commit/5997a64cba73ad6784a13cb7767d727008aa48db
@duncan3dc I'm sure you can use that too.
@nerdoc That will help a little, but not much. Most of the work is handling multi byte characters. Does your library support multi byte character encodings?
Other users have the same issue. Looking at Stackoverflow I saw this:
Is it right that common EDI files per default have iso8859-1 as coding?
@duncan3dc hm. I coded in Python3, so per default it uses UTF8 for internal strings. But I can specify the coding in the open() function for the file, but what I read, commonly iso8859-1 is used in EDI, at least in Europe. So I decided to use that as default when opening files. That doesn't help neither much I'm afraid.
See here:
EDIFACT standards define a number of character sets, coded in the UNB segment as UNOA,
UNOB, UNOC, UNOD etc. EDItEUR has adopted UNOC as the standard set for book and serials
trading. This character set permits the representation of a full repertoire of special
characters, including accents, for most European languages which use the Latin alphabet.
It corresponds to the international standard character set ISO 8859.1.
and Microsoft's
@HMAZonderland I've just pushed an improvement to the multibyte handling. For a 300kb file the parse time has been reduced from 4 minutes to 4 seconds.
It still doesn't perform well enough on your files though, so I'll continue to investigate further performance improvements
@duncan3dc sounds like a good improvement so far!
Now (using @nerdoc's advice, thanks!) I've reduced a 2.7mb file from 5 minutes to 20 seconds.
This should make it practical to use on your large files now. I'll continue to see if further improvements can be made, and add some regression tests to ensure performance stays reasonable in future.
I'll leave this issue open until the work has been released, thanks for your help :+1:
Good library you have there but I'm running into performance issues. It takes the library a long time to process large (PRICAT) files. I have some EDI files that are over 1Mb (even one of 10Mb), roughly 4000 EAN numbers or 60.000 somewhat lines.
Is there anything I can tweak in the library to speed things up?