DandelionSprout / adfilt

The place where I, DandelionSprout, store my web filter lists for countless topics, including my Nordic adblock list. As simple as that, really.
Other
1.3k stars 143 forks source link

Mixed newlines #819

Closed scripthunter7 closed 1 year ago

scripthunter7 commented 1 year ago

It seems that many filter lists have inconsistent (mixed) newlines, which may be a problem, as many editors don't handle this well. For example, VSCode will consolidate newlines on save (i.e. only one kind of newline will be used after save). If the contributor is not careful enough, his commit will contain significantly more changes than originally planned, since in this case the newlines are also replaced in lines where the contributor didn't make any changes originally.

If possible, please consider standardising the newlines.

Example:

https://github.com/DandelionSprout/adfilt/blob/master/NorwegianExperimentalList%20alternate%20versions/NordicFiltersABP-Inclusion.txt

image

DandelionSprout commented 1 year ago

There has been a more or less known problem for some years that following the list conversion instructions at https://github.com/DandelionSprout/adfilt/tree/master/NorwegianExperimentalList%20alternate%20versions#how-to-run-the-list-generation-script give dramatically different newline handling depending on whether the instructions were run with Cygwin or PowerShell, with Cygwin being recommended specifically due to the "more changes than originally planned" problem.

Since both PowerShell and VSCode are Microsoft products and stick to Microsoft's newlines, while most other tools seem to stick to Unix newlines (e.g. Cygwin, Sublime Text 4), I don't think this'll be nearly as easy to fix as it may seem like.

hagezi commented 1 year ago

You can add the following to .gitattributes:

https://github.com/hagezi/dns-blocklists/blob/main/.gitattributes

krystian3w commented 1 year ago

# Auto detect text files and perform LF normalization
text=auto eol=lf

This protect files before these malfromation after edit only 1 line in simple web CodeMirror editor:

https://github.com/FiltersHeroes/KAD/commit/0568c165bf9971b86f9987264f4c30223a65a59a https://github.com/FiltersHeroes/KAD/commit/0568c165bf9971b86f9987264f4c30223a65a59a.patch

?

I don't know if MS is already working on fixes to make editing huge files work properly (A few months ago, it was not achievable to edit when the file had about 25,000 lines) in this simple editor (they certainly fixed the speed and I can already safely use regexes for searching). So if that doesn't fix it MS globally or the declaration in gitattributes then I'm left using github.dev or a local editor and uploading changes to the server.

DandelionSprout commented 1 year ago

@scripthunter7 Can you check if the newlines work better now, after I added Hagezi's recommendations some days ago?

krystian3w commented 1 year ago

0 changes:

obraz

Maybe neededed is rebuild files via global github action.

DandelionSprout commented 1 year ago

Okay, I think I found the problem hiding in plain sight: My converter scripts had been using text += line + '\r\n' for years now.

I'll see if removing /r fixes things. Give me 5min.

kakaiba-talaga commented 1 year ago

@scripthunter7 Can you check if the newlines work better now, after I added Hagezi's recommendations some days ago?

That will only be applied to files that were recently updated and committed.

DandelionSprout commented 1 year ago

I think I've been able to fix the problem fully now. I'm awaiting confirmation of it.

krystian3w commented 1 year ago

updated and committed.

Updated 19 hours ago via upload: https://github.com/DandelionSprout/adfilt/commit/7789eb33684cf62662a3c8a41a5d771625d68d17

So needed is use git in command line?

DandelionSprout commented 1 year ago

For me it shows it was updated 4 minutes ago: https://github.com/DandelionSprout/adfilt/commit/55db7350df42f544f5107ae4f529fc26d8308759

krystian3w commented 1 year ago

Me too, trying enough how we define commited and updated.

As commited is different from via upload then MS should improve the gitattributes system to consider the option to drag and drop a file as "commited".

iam-py-test commented 1 year ago

Drag and drop/file upload just uses the line endings it was uploaded with. You need git commit or GH UI to do that.

scripthunter7 commented 1 year ago

It seems to me that some python scripts still use CRLF in some places:

Tbh, I didn't look deeper into what these scripts do, I just searched for the CRLF pattern.

.gitattributes seems fine to me, however, already existing files should be normalized.

Reference: https://docs.github.com/en/get-started/getting-started-with-git/configuring-git-to-handle-line-endings https://git-scm.com/docs/gitattributes#_end_of_line_conversion

scripthunter7 commented 1 year ago

Currently, these files contains mixed newlines (CRLF and LF at the same time):

Show file list - Other domains versions/Nitter De-Politificator.txt - Other domains versions/FanboyNotifications-LoadableInUBO.txt - Alternate versions Anti-Malware List/AntiMalwareAdGuardHome.txt - Alternate versions Anti-Malware List/AntiMalwareABP.txt - AdGuard Home Compilation List/AdGuardHomeCompilationListIPs.txt

And these files contains CRLF (fully CRLF or mixed):

Show file list - Other domains versions/WikiaPureBrowsingExperienceDomains.txt - Other domains versions/Nitter De-Politificator.txt - Other domains versions/FanboyNotifications-LoadableInUBO.txt - Other domains versions/BrowseWebsitesWithoutLoggingInDomains.txt - Other domains versions/BrowseWebsitesWithoutLoggingInAGH.txt - Other domains versions/AntiStevenUniverseListDomains.txt - Other domains versions/AntiKpopNitter.txt - Other domains versions/AntiHivemindCartoonTrashingListDomains.txt - Other domains versions/AntiF%D1%96%D0%9C%20ListDomains.txt - Other domains versions/AntiAmazonListForTwitchHosts.txt - Other domains versions/AntiAmazonListForTwitchDomains.txt - Alternate versions Anti-Malware List/AntiMalwareHosts.txt - Alternate versions Anti-Malware List/AntiMalwareDomains.txt - Alternate versions Anti-Malware List/AntiMalwareAdGuardHome.txt - Alternate versions Anti-Malware List/AntiMalwareAdGuard.txt - Alternate versions Anti-Malware List/AntiMalwareABP.txt - AdGuard Home Compilation List/AdGuardHomeCompilationListIPs.txt - AdGuard Home Compilation List/AdGuardHomeCompilationList.txt - AdGuard Home Compilation List/AdGuardHomeCompilationList-Notifications.txt
scripthunter7 commented 1 year ago

Well, I converted everything to LF, but it resulted in a PR of ~150k lines: https://github.com/DandelionSprout/adfilt/pull/833 (this not only normalizes mixed newlines, but also makes files in the repo consistently use LF). As far as I can see, the previous commit also had ~25k lines. But for my part, I don't want to leave such a big change in the repo history, so I closed this PR instead

DandelionSprout commented 1 year ago

It seems to me that some python scripts still use CRLF in some places:

That's mostly because I didn't edit those lists' conversion scripts just yet. I'll get it done this weekend.

krystian3w commented 1 year ago

Commit as ghost still exist: https://github.com/DandelionSprout/adfilt/commit/6dc3abac7689002945f357a73dab6c8d9d89d542 perhaps it is possible to generate with this the problems the Energy list had - MS told them to remove the repository history or they won't come back.

scripthunter7 commented 1 year ago

Anyway, I simply converted the files, but I think the change will be huge even if the build scripts are changed

DandelionSprout commented 1 year ago

With the filesize of https://github.com/DandelionSprout/adfilt/blob/master/AdGuard%20Home%20Compilation%20List/AdGuardHomeCompilationList.txt being inexplicably reduced from circa 945 KB to 900 KB almost solely by removing /r, my current hasardous guess is that it will pay off regardless.

the problems the Energy list had - MS told them to remove the repository history or they won't come back.

Energized also had lists with several million entries, and were previously (2018~21) far more popular than their poor maintenance record would've indicated, likely pushing GitHub's dataservers. So that didn't exactly make MS look at them any milder either.

DandelionSprout commented 1 year ago

In so far as I can currently tell, all affected lists (except MetricRecipe.py, whose generated files are never intended for publication) have now been converted to non-/r with no known long-lasting issues resulting form it, effectively fixing this issue report. Comments will remain open.