alvra / django-spotnet

A Django app to manage and download posts from Spotnet.
GNU General Public License v3.0
7 stars 6 forks source link

Use XOver instead of HEAD/Article #9

Closed Xaroth closed 3 years ago

Xaroth commented 11 years ago

Hi,

First of all, your code was somewhat an inspiration when designing my own django spotnet app; however, I have made some significant speed improvements.

1) I dropped the use of header/article checks, main reason for that was that it was slow, and that posts since 2011 and upwards all (and a lot before that ) have all the important information set in the headers you receive when doing xover.

2) your method of determining the last post date by doing a stat on the last msg id fails, I would personally stay away from it; make a model for the group you try to update, and store it in there (or, as I do it, store the post id and group in the spots itself, so you can always query it)

3) use the RSA signing method spotweb uses (it took me a while to figure out what was actually happening, but it works great) to verify the validity of a post... same goes for dispose messages; only validated dispose messages should be accepted, and can then be used to directly remove the post in question (and create a marker on the targeted msgid so you won't load it back in later).

4) you must be going: but what of the data that isn't in the xover data.. simple... load it on demand; the main spotnet application does so too.. it xovers to find the new posts, parses it, stores it, and when you double click on it, it does a head/article call on it to get all the other info (nzb segments, image segments, even comments are done this way).. which means you only store info in your database that you actually use.

5) I've found personally that using regexes work a lot faster than manually parsing data character by character.. for example, your category parser can also be parsed with this regex:

    re.compile('([abcdz])([0-9]{1,2})')

and the invalid character remover:

    re.compile('[^\x20-\x7F]*')

my current test code manages to parse around 250 items every 3 seconds; or about 250k posts per hour ... now there's a lot of room for improvement, so I wouldn't be surprised if I manage to hit double that in a not-too-distant future... on the same machine with your code I managed to get around 35k items per hour.

I might seem a complete arse for not sharing my code, but it's a private project so I'm not at liberty to discuss actual internals.. but I couldn't resist not helping a fellow django coder out.