Ultimate-Hosts-Blacklist / dev-center

The place to talk about our infrastructure or everything related to the Ultimate Hosts Blacklist project.
MIT License
11 stars 2 forks source link

[Urgent] Issue with anudeepND's list #12

Closed funilrys closed 5 years ago

funilrys commented 5 years ago

This is just for tracking. cc @dnmTx

Since last night I get a major failing report per email from that list.

Will take the time later today or this weekend to track the issue with that list.

funilrys commented 5 years ago

Actually, the clean list was generated without any issue ...

It's only about the repetition of Update of info.json which is for sure not normal.

dnmTX commented 5 years ago

Thanks for taking the time @funilrys. At least the clean.list is fine which is what i'm using on my end as a source.

dnmTX commented 5 years ago

@funilrys the problem is still present.Both anudeepND and justdomains's lists the hosts/ACTIVE contains more entries then in clean.list.If you don't mind look in to it when you have time.Thank you!

funilrys commented 5 years ago

@dnmTX just a little update, justdomains is not more the case :smile_cat:

For anudeepND, it's still a mystery for my mind and eyes but I'll let it run on my machine in order to find out where the issue is. I suspect that it's from the script side but it can also be related to the container itself :thinking: Will find out.

Have a nice day/night.

Cheers, Nissar.

dnmTX commented 5 years ago

@funilrys thanks for the update.Let's hope that the issue with anudeepND's lists is not complicated so i don't need to worry about it every time when it's time for update.Looks like more you adding more issues start showing up.Maybe it's a good idea to drop some of them(just a thought)?

funilrys commented 5 years ago

@dnmTX, should be now fixed. I simplified the script in order to avoid the repetition of the Update info.json commit which caused our issue overall!

Closing, will reopen if the issue is not fixed.

funilrys commented 5 years ago

Between we do consider each list as a single input source so it's really not because we add more input sources :wink:

dnmTX commented 5 years ago

Between we do consider each list as a single input source so it's really not because we add more input sources 😉

Got it

@dnmTX, should be now fixed. I simplified the script in order to avoid the repetition of the Update info.json commit which caused our issue overall!

BIG Thank you and i'll still keep an eye on it and if any changes i'll let you know 👍

funilrys commented 5 years ago

Confirmed!

dnmTX commented 5 years ago

@funilrys looks like anudeepND's lists got stuck again on filtering.Please check.Thank you.

EDIT: Well appears to start moving again,even finished the filtering(after being stuck for 2 days) but the mentioned inconsistencies between the clean.list and hosts/ACTIVE are present again.

funilrys commented 5 years ago

@dnmTX I can't explain what's going on ... I did nothing and it's fixed :sob:

dnmTX commented 5 years ago

@funilrys the INVALID folder needs to be refresh when filtering is done as well. We already WORKING on clearing out those invalid entries and i personally going by what is in that invalid folder and many got removed(from the original lists) but still remain in that folder. This way is much harder to weed out the rest.

anudeepND commented 5 years ago

Sorry everyone, I was unaware of this thread. I have some invalid entries in my host file which I was not aware of, I will fix ASAP.

anudeepND commented 5 years ago

Update: I have removed about ~200 domains from the list which are irrelevant or invalid entries. Please ping me if I missed something. Thank you :)

dnmTX commented 5 years ago

@funilrys the problem is back.There is a 3000+ more domains in /ACTIVE/hosts compare to clean.list

dnmTX commented 5 years ago

OK....i kind of figured it out what happened but not why it's happening(i'll leave that to you so you can trace the origin).So,all those 3000+ domains in ACTIVE/hosts are DUPLICATES,most likely dumped(merged) by another list (POTENTIALLY_ACTIVE or SUSPICIOUS or w/e).THIS IS A BUG. Somewhere,somehow the script is merging two lists for no reason.I guess now at least you know where to start. What i did is i downloaded both ACTIVE/hosts and clean.list and checked for duplicates between the two,which didn't produce any results.Then i checked for duplicates in ACTIVE/hosts and what do you know....3000+ domains popped up.Double checked several of the resulted domains with the original ACTIVE/hosts and each one had two entries. Just in case attaching here the list with the duplicate entries(only) so you can compare it if you want: duplicates.txt

funilrys commented 5 years ago

@dnmTX Hm Actually we write into the Analytic directory some extra information so it's normal that they are listed at least into SUSPICIOUS and ACTIVE (inside Analytic) because they are for one suspicious because the element was inactive and it is now ACTIVE and for the other one the HTTP status code is 200 (for example) and the rest of the tests tell us something else.

So for both SUSPICIOUS and ACTIVE elements we write at the Analytic directory and the official output directory.

Also, clean.list is without duplicates because we format the list of active without the duplicates ... If you understand Python here is the code that generates the clean.list:

https://github.com/Ultimate-Hosts-Blacklist/repository-structure/blob/master/administration.py#L50-L74

The line which removes all duplicates before writting clean.list is https://github.com/Ultimate-Hosts-Blacklist/repository-structure/blob/master/administration.py#L70

dnmTX commented 5 years ago

@funilrys i'm sorry but i have no knowledge in Python. So you saying that it's normal the ACTIVE/hosts to have that many duplicates or....? The reason that i came to the conclusion that two lists were merged at some point was that all those duplicates were placed(appended) at the bottom of the list and weren't all over.Usually cat does that,placing it always at the bottom.

funilrys commented 5 years ago

@dnmTX output/{domains, hosts, json}/* should not have duplicates. I have the feeling that it is because of the container, not PyFunceble.

I'm saying that because after leaving the list running on my machine I could not reproduce what we have in the repository even on a simulation of a normal launch of Travis CI.

I'll have to look at all the past Travis CI build in order to find out if it is really Travis CI or PyFunceble in some rare cases.

Cheers, Nissar

dnmTX commented 5 years ago

@funilrys thanks for clearing this out(sorry,sometimes i feel like i'm "THE BAD NEWS GUY"),but looks like after the latest CHANGES no duplicates has been generated but i'll keep monitoring though. If you ask me,focus on anudeepND's lists as it is for some reason the most troublesome.

dnmTX commented 5 years ago

as of Jan-13-2019 clean.list=23,171 domains /ACTIVE/hosts=24,576 domains ...still happening :thinking:

dnmTX commented 5 years ago

@funilrys i checked in travis.ci and it says that it finished filtering 13 hours ago but the clean.list is not updated and info.json shows that it's still under test. Can you check please.Thank you.

funilrys commented 5 years ago

@dnmTX I'm preparing a maintenance for the coming days.

I planned many hours in the coming weeks for a review of all input sources (one by one). Will put that into the list to check for that list.

dnmTX commented 5 years ago

@funilrys maintenance is needed indeed.Also it's the same behavior in other repos too,sometimes the clean.list,info.json etc doesn't get updated until the next filtering cycle is done. Let's hope that you'll be able to trace that issue. It was important to me(in the past) the clean.list to be updated on time,now,i need also the info.json too.Due to recent discoveries on which i wasn't familiar with now i have to extract additionally all those *.doubleclick.net domains and my script first checking with the info.json file if it's still under test or not (the test needs to be finished in order to fetch all the domains otherwise some are still missing).And you get the picture when it shows 1 all the time right? "currently_under_test": "1"

funilrys commented 5 years ago

Closing.