Closed funilrys closed 5 years ago
Actually, the clean list was generated without any issue ...
It's only about the repetition of Update of info.json
which is for sure not normal.
Thanks for taking the time @funilrys. At least the clean.list
is fine which is what i'm using on my end as a source.
@funilrys the problem is still present.Both anudeepND and justdomains's lists the hosts/ACTIVE contains more entries then in clean.list.If you don't mind look in to it when you have time.Thank you!
@dnmTX just a little update, justdomains is not more the case :smile_cat:
For anudeepND, it's still a mystery for my mind and eyes but I'll let it run on my machine in order to find out where the issue is. I suspect that it's from the script side but it can also be related to the container itself :thinking: Will find out.
Have a nice day/night.
Cheers, Nissar.
@funilrys thanks for the update.Let's hope that the issue with anudeepND's lists is not complicated so i don't need to worry about it every time when it's time for update.Looks like more you adding more issues start showing up.Maybe it's a good idea to drop some of them(just a thought)?
@dnmTX, should be now fixed. I simplified the script in order to avoid the repetition of the Update info.json
commit which caused our issue overall!
Closing, will reopen if the issue is not fixed.
Between we do consider each list as a single input source so it's really not because we add more input sources :wink:
Between we do consider each list as a single input source so it's really not because we add more input sources 😉
Got it
@dnmTX, should be now fixed. I simplified the script in order to avoid the repetition of the
Update info.json
commit which caused our issue overall!
BIG Thank you and i'll still keep an eye on it and if any changes i'll let you know 👍
Confirmed!
@funilrys looks like anudeepND's lists got stuck again on filtering.Please check.Thank you.
EDIT: Well appears to start moving again,even finished the filtering(after being stuck for 2 days) but the mentioned inconsistencies between the clean.list
and hosts/ACTIVE
are present again.
@dnmTX I can't explain what's going on ... I did nothing and it's fixed :sob:
@funilrys the INVALID folder needs to be refresh when filtering is done as well. We already WORKING on clearing out those invalid entries and i personally going by what is in that invalid folder and many got removed(from the original lists) but still remain in that folder. This way is much harder to weed out the rest.
Sorry everyone, I was unaware of this thread. I have some invalid entries in my host file which I was not aware of, I will fix ASAP.
Update: I have removed about ~200 domains from the list which are irrelevant or invalid entries. Please ping me if I missed something. Thank you :)
@funilrys the problem is back.There is a 3000+ more domains in /ACTIVE/hosts
compare to clean.list
OK....i kind of figured it out what happened but not why it's happening(i'll leave that to you so you can trace the origin).So,all those 3000+ domains in ACTIVE/hosts
are DUPLICATES,most likely dumped(merged) by another list (POTENTIALLY_ACTIVE
or SUSPICIOUS
or w/e).THIS IS A BUG.
Somewhere,somehow the script is merging two lists for no reason.I guess now at least you know where to start.
What i did is i downloaded both ACTIVE/hosts
and clean.list
and checked for duplicates between the two,which didn't produce any results.Then i checked for duplicates in ACTIVE/hosts
and what do you know....3000+ domains popped up.Double checked several of the resulted domains with the original
ACTIVE/hosts
and each one had two entries.
Just in case attaching here the list with the duplicate entries(only) so you can compare it if you want:
duplicates.txt
@dnmTX Hm Actually we write into the Analytic
directory some extra information so it's normal that they are listed at least into SUSPICIOUS
and ACTIVE
(inside Analytic
) because they are for one suspicious because the element was inactive and it is now ACTIVE
and for the other one the HTTP status code is 200
(for example) and the rest of the tests tell us something else.
So for both SUSPICIOUS
and ACTIVE
elements we write at the Analytic
directory and the official output directory.
Also, clean.list
is without duplicates because we format the list of active without the duplicates ... If you understand Python here is the code that generates the clean.list
:
The line which removes all duplicates before writting clean.list
is https://github.com/Ultimate-Hosts-Blacklist/repository-structure/blob/master/administration.py#L70
@funilrys i'm sorry but i have no knowledge in Python.
So you saying that it's normal the ACTIVE/hosts
to have that many duplicates or....?
The reason that i came to the conclusion that two lists were merged at some point was that all those duplicates were placed(appended) at the bottom of the list and weren't all over.Usually cat
does that,placing it always at the bottom.
@dnmTX output/{domains, hosts, json}/*
should not have duplicates. I have the feeling that it is because of the container, not PyFunceble.
I'm saying that because after leaving the list running on my machine I could not reproduce what we have in the repository even on a simulation of a normal launch of Travis CI.
I'll have to look at all the past Travis CI build in order to find out if it is really Travis CI or PyFunceble in some rare cases.
Cheers, Nissar
@funilrys thanks for clearing this out(sorry,sometimes i feel like i'm "THE BAD NEWS GUY"),but looks like after the latest CHANGES no duplicates has been generated but i'll keep monitoring though. If you ask me,focus on anudeepND's lists as it is for some reason the most troublesome.
as of Jan-13-2019
clean.list
=23,171 domains
/ACTIVE/hosts
=24,576 domains
...still happening :thinking:
@funilrys i checked in travis.ci and it says that it finished filtering 13 hours ago but the clean.list
is not updated and info.json
shows that it's still under test.
Can you check please.Thank you.
@dnmTX I'm preparing a maintenance for the coming days.
I planned many hours in the coming weeks for a review of all input sources (one by one). Will put that into the list to check for that list.
@funilrys maintenance is needed indeed.Also it's the same behavior in other repos too,sometimes the clean.list,info.json etc
doesn't get updated until the next filtering cycle is done. Let's hope that you'll be able to trace that issue.
It was important to me(in the past) the clean.list
to be updated on time,now,i need also the info.json
too.Due to recent discoveries on which i wasn't familiar with now i have to extract additionally all those
*.doubleclick.net
domains and my script first checking with the info.json
file if it's still under test or not
(the test needs to be finished in order to fetch all the domains otherwise some are still missing).And you get the picture when it shows 1
all the time right?
"currently_under_test": "1"
Closing.
This is just for tracking. cc @dnmTx
Since last night I get a major failing report per email from that list.
Will take the time later today or this weekend to track the issue with that list.