NNTmux / newznab-tmux

Laravel based usenet indexer
GNU General Public License v3.0
207 stars 54 forks source link

Backfill gets stuck #1166

Closed egyptianbman closed 3 years ago

egyptianbman commented 3 years ago

Describe the bug I'm working to backfill various groups and am finding that backfill gets stuck prematurely. I'm not sure what's causing it to get stuck early and would be happy to dig deeper into it, but wanted to see if you have any insight before I start.

One example is alt.binaries.multimedia. I enabled backfill some time ago and there were around 10B parts remaining when it started. Now, it's down to 5,232,777,007. Unfortunately, the backfilling doesn't seem to be progressing past that as can be seen in the following screenshot:

image

DrakeJones commented 3 years ago

An interesting dilemma, @egyptianbman. Oddly enough I'm backfilling that very same group now, but only trying to go back 12 years and nowhere near that yet, but I've backfilled a number of groups that far already. My provider info (eweka) shows about 10.6B parts total in a.b.multimedia have been indexed at this time, but I suspect the number parts actually retained don't go that far back.

How many years worth of collections did you get from the 5B parts you did get through? That might be the actual end of your provider's retention rainbow. How far back are you actually trying to go in years? Which provider are you using? As an experiment, if you have a clone of your NNTmux laying around, you could switch providers and see what a.b.multimedia does.

Your stuck with 15K days to go? That's 41 years. Usenet was only born 1980 - 42 years ago. Text only at that time. yEnc was introduced in 2001 (I just learned) and so that's probably the ultimate retention limit for "practical" purposes. If you want and a repository with "ancient" historical data read this: https://www.vice.com/en/article/pky7km/usenet-archive-utzoo-online

(Maybe not the appropriate forum for this, but please indulge me).

DariusIII commented 3 years ago

I agree with what @DrakeJones said. Are you sure your provider goes back that far?

ghost commented 3 years ago

Probably not seeing most don't even get over 4000 days.

egyptianbman commented 3 years ago

The usenet server I use is https://usenetserver.com/ which claims to have 4646 days retention. I initially set the date to 1980 just to ensure I go back as far as possible but after reading the responses, I updated it to 2008-8-15 since that's how far back usenetserver.com seems to go. I re-enabled alt.binaries.multimedia and now see this: image

So while the number of days more accurately reflects the retention my connection supposedly has access to, the number of articles has not changed. Unfortunately, I only have the one connection so I can't test others.

DrakeJones commented 3 years ago

4646 days retention is 12.7 years worth of retention. How far back did you successfully backfill with the target date set to 40 years?

egyptianbman commented 3 years ago

According to the following query:

SELECT MIN(r.postdate)
FROM releases r
INNER JOIN releases_groups rg ON r.id = rg.releases_id
INNER JOIN usenet_groups ug ON rg.groups_id = ug.id
WHERE ug.name LIKE 'alt.binaries.multimedia'

The oldest release I have is dated 2008-08-14 22:07:51 which falls in line with the retention.

DrakeJones commented 3 years ago

FWIW, I get a similar MIN(r.postdate) date result from that query for alt.binaries.multimedia: 2008-08-16, but I've only backfilled that group 9 months back. Running that query against alt.binaries.sounds.lossless.24bit yields 2012-04-08 and my oldest post for this fully-backfilled group is indeed 2012-04-08. But, running the query against groups that I've never activated still gives dates; i.e., alt.binaries.uzenet yields a MIN(r.postdate) 2008-08-31. So, that query doesn't give you your oldest post; it gives you the providers oldest post.

I don't know where groups' MIN(r.postdate) are generated/pulled from, but they do seem to reflect actual provider retention, as far as I can see.

egyptianbman commented 3 years ago

I just ran into this issue with alt.binaries.blu-ray and decided to dig deeper. What I found was that an error is being thrown by misc/update/multiprocessing/.do_not_run/switch.php but is not being reported due to the multiprocessing code hiding errors.

The error is:

[2021-07-27 03:06:25] local.ERROR: #0 /path/to/Blacklight/Binaries.php(898): Illuminate\Foundation\Bootstrap\HandleExceptions->handleError()
#1 /path/to/Blacklight/Binaries.php(686): Blacklight\Binaries->storeHeaders()
#2 /path/to/misc/update/multiprocessing/.do_not_run/switch.php(117): Blacklight\Binaries->scan()
#3 {main}  
A non-numeric value encountered%                                                                                                                                                                               

I added a debug of:

if (!$this->header['Bytes']) {
    print_r($this->header);
}

above https://github.com/NNTmux/newznab-tmux/blob/faeea9c3bf73d1e9bc91ad3d2030d28129d6a14a/Blacklight/Binaries.php#L894 for debugging and got the following output:

...
Array
(
    [Number] => 2564884603
    [Subject] => "FlSVCT3kU9X4mTh.vol229+18.par2" yEnc (516/684)
    [From] => JBinUp <JBinUp@JBinUp.local>
    [Date] => Sun, 06 Jan 2019 09:40:07 -0600
    [Message-ID] => <TfchNDLanU3QuYWhRlju@JBinUp.local>
    [References] => 
    [Bytes] => 
    [Lines] => 3062
    [Xref] => news.usenetserver.com alt.binaries.blu-ray:2564884603
    [matches] => Array
        (
            [0] => "FlSVCT3kU9X4mTh.vol229+18.par2" yEnc (516/684)
            [1] => "FlSVCT3kU9X4mTh.vol229+18.par2" yEnc
            [2] => 516
            [3] => 684
        )

)

I was able to fix the problem by changing https://github.com/NNTmux/newznab-tmux/blob/faeea9c3bf73d1e9bc91ad3d2030d28129d6a14a/Blacklight/Binaries.php#L668 to:

if (! isset($header['Bytes']) || !$header['Bytes']) {
DrakeJones commented 3 years ago

That worked for me.

DariusIII commented 3 years ago

As soon as I have time I'll incorporate this fix. Thanks.

On Wed, Jul 28, 2021, 16:08 DrakeJones @.***> wrote:

That worked for me.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NNTmux/newznab-tmux/issues/1166#issuecomment-888339409, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZ572TEXIFQVOPPVNY3BZ3T2AFNHANCNFSM437B3CBA .