bebo-dot-dev / m3u-epg-editor

a python m3u / epg optimizer
120 stars 27 forks source link

Processing times #70

Closed onestix closed 2 years ago

onestix commented 2 years ago

Hi @bebo-dot-dev, thank you for your amazing work on this project so far.

I have been looking at ways to slim down an EPG file (425mb) that I currently have access to. It includes EPGs for approximately 6000 channels. I have m3u4u to manage/clean my m3u file and I am looking at an efficient way of trimming down the data in my epg file (m3u4u does not do that). My new playlist has appx 2500 channels now and it takes me a little over an hour to produce the corresponding EPG file using this script. I am using the following config for doing so:

{
    "m3uurl": "file:///channels.m3u",
    "epgurl": "file:///epg.xml",
    "groups": [
        ""
    ],
    "no_tvg_id": false,
    "no_epg": false,
    "force_epg": false,
    "no_sort": true,
    "outdirectory": "/output",
    "outfilename": "new",
    "log_enabled": true,
    "preserve_case": true
}

With that said, is the approximate 1 hr and 10 mins of processing time expected here? I am running this on a decent NAS (Synology DS918+, with upgraded RAM to 32GB). My CPU only peaks at 25% percent when the script is running. Also a note that this is being executed in a Docker (Ubuntu image w. python 3.10.4). The current 'programme element' creation is currently taking appx 2 seconds per channel. The initial channel creation itself on the other hand is very quick and done in a few seconds.

I have tried narrowing down the --range to 12 hours, without any luck. It took the same time to process the list.

2022-08-29T17:28:03.143055 run.py process started with Python v3.10.4 (main, Jun 29 2022, 12:14:53) [GCC 11.2.0]
2022-08-29T17:28:03.143644 input script arguments: Namespace(json_cfg='config.json', m3uurl=None, epgurl=None, request_headers=[], groups=None, groupmode='keep', discard_channels=None, include_channels=None, id_transforms=[], group_transforms=[], channel_transforms=[], range=168, sortchannels=None, xml_sort_type='none', tvh_start=None, tvh_offset=None, no_tvg_id=False, no_epg=False, force_epg=False, no_sort=False, http_for_images=False, preserve_case=False, outdirectory=None, outfilename=None, log_enabled=False)
2022-08-29T17:28:03.143889 json configuration: {"m3uurl": "file:///channels.m3u","epgurl": "file:///epg.xml","groups": [""],"no_tvg_id": false,"no_epg": false,"force_epg": false,"no_sort": true,"outdirectory": "/output","outfilename": "new","log_enabled": true,"preserve_case": true}
2022-08-29T17:28:03.144175 determined runtime script arguments: Namespace(json_cfg='config.json', m3uurl='file:///channels.m3u', epgurl='file:///epg.xml', request_headers={}, groups={''}, groupmode='keep', discard_channels=[], include_channels=[], id_transforms=[], group_transforms=[], channel_transforms=[], range=168, sortchannels=[], xml_sort_type='none', tvh_start=0, tvh_offset=0, no_tvg_id=False, no_epg=False, force_epg=False, no_sort=True, http_for_images=False, preserve_case=True, outdirectory='/output', outfilename='new', log_enabled=False, group_idx=[''])
2022-08-29T17:28:03.144235 performing HTTP GET request to file:///channels.m3u
2022-08-29T17:28:03.153432 saving retrieved m3u file: /output/original.m3u8
2022-08-29T17:28:03.155263 parsing m3u into a list of objects
2022-08-29T17:28:03.765736 m3u contains 2593 items
2022-08-29T17:28:03.766157 keeping channel groups in this list ['']
2022-08-29T17:28:03.878236 filtered m3u contains 7612 items
2022-08-29T17:28:03.878436 saving new m3u file: /output/new.m3u8
2022-08-29T17:28:03.927600 performing HTTP GET request to file:///epg.xml
2022-08-29T17:28:04.829700 saving retrieved epg file: /output/original.xml
2022-08-29T17:28:05.185184 creating new xml epg for 2593 m3u items
2022-08-29T17:28:14.796286 creating channel element for xyz
2022-08-29T17:28:14.797072 creating channel element for xyz
2022-08-29T17:28:14.797889 creating channel element for xyz
2022-08-29T17:28:14.798611 creating channel element for xyz
2022-08-29T17:28:14.799089 creating channel element for xyz
2022-08-29T17:28:14.801186 creating channel element for xyz
2022-08-29T17:28:14.802039 creating channel element for xyz
2022-08-29T17:28:14.802765 creating channel element for xyz
2022-08-29T17:28:14.803495 creating channel element for xyz
2022-08-29T17:28:14.804257 creating channel element for xyz
...
2022-08-29T18:37:01.446605 creating programme elements for xyz
2022-08-29T18:37:03.028149 creating programme elements for xyz
2022-08-29T18:37:04.658316 creating programme elements for xyz
2022-08-29T18:37:06.264646 creating programme elements for xyz
2022-08-29T18:37:07.852607 creating programme elements for xyz
2022-08-29T18:37:09.495411 creating programme elements for xyz
2022-08-29T18:37:11.116687 creating programme elements for xyz
2022-08-29T18:37:12.850419 creating programme elements for xyz
2022-08-29T18:37:14.588234 creating programme elements for xyz
2022-08-29T18:37:16.343684 creating programme elements for xyz
2022-08-29T18:37:18.149224 creating programme elements for xyz
2022-08-29T18:37:19.968244 creating programme elements for xyz
2022-08-29T18:37:21.701999 creating programme elements for xyz
2022-08-29T18:37:23.379243 creating programme elements for xyz
2022-08-29T18:37:25.130633 creating programme elements for xyz
2022-08-29T18:37:26.711339 creating programme elements for xyz
2022-08-29T18:37:28.315415 creating programme elements for xyz
2022-08-29T18:37:29.889484 creating programme elements for xyz
2022-08-29T18:37:31.469465 creating programme elements for xyz
2022-08-29T18:37:32.997105 configured epg programme start/stop range is +/-168hrs from now (22 Aug 2022 20:37 <-> 05 Sep 2022 20:37)
2022-08-29T18:37:32.997192 latest programme start timestamp found was: 08 Sep 2022 12:00
2022-08-29T18:37:32.997218 407884 programmes were added to the epg
2022-08-29T18:37:43.761088 saving new epg xml file: /output/new.xml
2022-08-29T18:37:45.704292 saving to log: /output/process.log
2022-08-29T18:37:45.709479 script runtime: 9 minutes 42 seconds
2022-08-29T18:37:45.709550 process completed

Interesting enough, the logs shows me a script runtime: 9 minutes 42 seconds which is not true, see actual log timestamps.

Would you have any thoughts on how I can speed this up?

Update: just attempted to run this on the host of my NAS and I am seeing very similar runtimes, apx 2 seconds per programme

bebo-dot-dev commented 2 years ago

Hi there, thanks for your issue report.

The first thing to say is that you've uncovered an edge case bug in the script runtime calculation. In the above example the script runtime was 1 hr 9 minutes 42 seconds but the script doesn't currently include the hour part because I never envisaged script runtime ever being this huge :)

The second thing to say is that there is something very wrong with script performance at the moment with the volume of data that you're processing through. There's a possibility that I'll be able to tweak code to achieve improved performance but to do that I would need test data which is representative of what you're using to enable me to debug / test.

If you're willing to strip out any/all sensitive values from your m3u and epg files (passwords, hostnames etc) and attach the files here, I'll take a look asap.

onestix commented 2 years ago

Hi @bebo-dot-dev, thank you so much for your prompt response. I have sent you an email with both the m3u and epg files. Let me know if you have received this.

Cheers, onestix

bebo-dot-dev commented 2 years ago

Hi again, email and data received thanks.

I'll test asap and update back here when I have something concrete for you :)

bebo-dot-dev commented 2 years ago

Following initial testing, early feedback is that the two second delay per EPG progamme is caused by the time that it takes to find all programme elements in the EPG xml data that match a given channel on the following line which is called iteratively for every wanted channel in the M3U:

https://github.com/bebo-dot-dev/m3u-epg-editor/blob/master/m3u-epg-editor-py3.py#L934

To put this in perspective, the supplied EPG I'm testing with is 465MB and it contains 1,232,817 programme elements and although the entire xml tree is in memory, .findall() is not efficient enough to work reasonably well with this volume of data.

I'm going to play with a few different strategies to see if I can improve performance in this area but I imagine that it will be some time before I have a fix (..if I manage to find a fix).

Just wanted to let you know what I've found so far, I'll stick with it.

bebo-dot-dev commented 2 years ago

Hi again, there's a candidate fix for this issue in the https://github.com/bebo-dot-dev/m3u-epg-editor/tree/epg-performance branch now.

EPG channel lookup performance is improved, I retested with the 465MB sample EPG file and it's now being processed in about 1 minute 11 seconds here .. is a slight improvement over the previous 1hr+ :)

Please give it a try and let me know how you go thanks.

onestix commented 2 years ago

Hi @bebo-dot-dev, you are amazing! Just ran the script and it completed for me in 2 minutes and 31 seconds with my actual dataset. Thank you so much for your help on this!

bebo-dot-dev commented 2 years ago

Hi again, no problem, sounds like a winner :)

Merging the change now.

bebo-dot-dev commented 2 years ago

All merged, feel free to use https://github.com/bebo-dot-dev/m3u-epg-editor/tree/master going forward again thanks