Slow performance with large files

HMAZonderland commented 6 years ago

Good library you have there but I'm running into performance issues. It takes the library a long time to process large (PRICAT) files. I have some EDI files that are over 1Mb (even one of 10Mb), roughly 4000 EAN numbers or 60.000 somewhat lines.

Is there anything I can tweak in the library to speed things up?

duncan3dc commented 6 years ago

Hi Hugo, there will definitely be some performance improvements we can make. Would it be possible for you to send me a version of your file?

HMAZonderland commented 6 years ago

I'm afraid I cannot post it here publicly because it contains pricing details between a client and his supplier. If you can guarantee me the file is not submitted here or anywhere else on the web I can sent it to you directly. I guess its to much work obfuscating the file contents as its this large.

duncan3dc commented 6 years ago

Of course, please send to git@duncanc.co.uk and I will ensure it goes no further

nerdoc commented 6 years ago

@HMAZonderland could you maybe anonymize this file and upload it? I am coding a python version of Craig's PHP edifact library, and would like to help here a bit or see if my version has the same problems - ? Alternatively, could you send a copy to github@nerdocs.at, I won't give it to anybody else neither (or I could try to anonymize it and, with your permission after your checking, publish the anonymized version for metroplex-systems/edifact and nerdocs/pydifact?

HMAZonderland commented 6 years ago

@duncan3dc I've send you the files, did you receive them? @nerdoc are you having performance issues with the Python version as well?

nerdoc commented 6 years ago

Never checked it, as I have no file larger than 10kb to test...

nerdoc commented 6 years ago

Hi, thanks. Although I'm testing with another lib, @duncan3dc (my Python one), I just did a profiling, with the first 100 lines of the file, thanks to @HMAZonderland - and it was interesting. Maybe you have similar results (as most of the code is a 1:1 transcode PHP->Python:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      890    4.647    0.005    4.647    0.005 tokenizer.py:85(get_next_char)
      436    0.024    0.000    4.701    0.011 tokenizer.py:94(get_next_token)
        1    0.022    0.022    0.038    0.038 {method 'read' of '_io.TextIOWrapper' objects}
        1    0.017    0.017    4.746    4.746 parser.py:32(parse)
        1    0.016    0.016    0.016    0.016 {built-in method _codecs.latin_1_decode}
        1    0.015    0.015    0.015    0.015 {method 'lstrip' of 'str' objects}
      890    0.014    0.000    4.661    0.005 tokenizer.py:67(read_next_char)
       31    0.007    0.000    0.007    0.000 {built-in method marshal.loads}
      791    0.007    0.000    0.007    0.000 tokenizer.py:134(is_control_character)
      849    0.006    0.000    4.455    0.005 tokenizer.py:146(store_current_char_and_read_next)
        1    0.004    0.004    4.713    4.713 tokenizer.py:44(get_tokens)
      437    0.004    0.000    0.004    0.000 token.py:32(__init__)
     1038    0.004    0.000    0.005    0.000 tokenizer.py:160(end_of_message)
  110/107    0.003    0.000    0.005    0.000 {built-in method builtins.__build_class__}
     36/1    0.003    0.000    4.815    4.815 {built-in method builtins.exec}
1492/1479    0.002    0.000    0.002    0.000 {built-in method builtins.len}

So you can see the get_next_char needs most of the time. That one is worth optimizing ;-) I think PHP will be similar.

duncan3dc commented 6 years ago

The first issue in PHP is the multibyte support, that is taking 50% of the time

nerdoc commented 6 years ago

ok. But have a look at your function:

    private function getNextChar()
    {
        $char = mb_substr($this->message, 0, 1);
        $this->message = mb_substr($this->message, 1);
        return $char;
    }

I'm not familiar with PHP as I am with Python, but mb_substr makes a copy of the string, right? so you copy the whole bunch of 10Mb string into another, cutting 2 bytes from it. And that thousands of times. I had a fix in 2 minutes - I'll post my commit in my library. Just made an index counter to iterate over the original message. So it is just one message in memory. 100 lines before: 5-6 sec, after: <10ms. @HMAZonderland, your test file with 2MB is parsed in 9sec now. That's not brilliant, but better than the >10minutes before ;-) https://github.com/nerdocs/pydifact/commit/5997a64cba73ad6784a13cb7767d727008aa48db @duncan3dc I'm sure you can use that too.

duncan3dc commented 6 years ago

@nerdoc That will help a little, but not much. Most of the work is handling multi byte characters. Does your library support multi byte character encodings?

HMAZonderland commented 6 years ago

Other users have the same issue. Looking at Stackoverflow I saw this:

https://stackoverflow.com/questions/45028018/php-7-mb-multibyte-functions-are-60-slower-than-in-5-3-windows-only-issue

nerdoc commented 6 years ago

Is it right that common EDI files per default have iso8859-1 as coding?

nerdoc commented 6 years ago

@duncan3dc hm. I coded in Python3, so per default it uses UTF8 for internal strings. But I can specify the coding in the open() function for the file, but what I read, commonly iso8859-1 is used in EDI, at least in Europe. So I decided to use that as default when opening files. That doesn't help neither much I'm afraid.

See here:

EDIFACT standards define a number of character sets, coded in the UNB segment as UNOA, 
UNOB, UNOC, UNOD etc. EDItEUR has adopted UNOC as the standard set for book and serials
trading. This character set permits the representation of a full repertoire of special 
characters, including accents, for most European languages which use the Latin alphabet.
It corresponds to the international standard character set ISO 8859.1.

and Microsoft's

duncan3dc commented 6 years ago

@HMAZonderland I've just pushed an improvement to the multibyte handling. For a 300kb file the parse time has been reduced from 4 minutes to 4 seconds.

It still doesn't perform well enough on your files though, so I'll continue to investigate further performance improvements

HMAZonderland commented 6 years ago

@duncan3dc sounds like a good improvement so far!

duncan3dc commented 6 years ago

Now (using @nerdoc's advice, thanks!) I've reduced a 2.7mb file from 5 minutes to 20 seconds.

This should make it practical to use on your large files now. I'll continue to see if further improvements can be made, and add some regression tests to ensure performance stays reasonable in future.

I'll leave this issue open until the work has been released, thanks for your help :+1:

duncan3dc commented 6 years ago

The performance improvements have been released as 1.1.0.

Thanks both of you for the help

estratocloud / edifact

Slow performance with large files #5