dgtlmoon / changedetection.io

The best and simplest free open source web page change detection, website watcher, restock monitor and notification service. Restock Monitor, change detection. Designed for simplicity - Simply monitor which websites had a text change for free. Free Open source web page change detection, Website defacement monitoring, Price change notification
https://changedetection.io
Apache License 2.0
16.85k stars 941 forks source link

CSV files and other TXT files charset encoding selection needed ( `windows-1251` etc ) #2277

Open alfablend opened 5 months ago

alfablend commented 5 months ago

Version and OS 0.45.16 on windows 11/docker

Is your feature request related to a problem? Please describe. CSV files (tables in plain text format) charset encoding selection is unsupported so content of these files may be unreadable . Changedetection as far as I understand use UTF-8 charset for these files. The CSV files that i need to monitor are in windows-1251 charset.

Describe the solution you'd like I need to have an opportunity to select correct chatset. My CSV files are encoded in windows-1251 charset.

Describe the use-case and give concrete real-world examples

изображение

There are a lot of big data in CSV format. It is text format that represents the data tables by using commas or other symbols. You can see more about it on Wikipedia: https://en.wikipedia.org/wiki/Comma-separated_values As text files, CSV may be encoded non in UTF-8. For example, in can have windows-1251 or koi8-r charset. CSV files that I try to use with changedetection app are unreadable due absense of charset selection .

dgtlmoon commented 5 months ago

Any chance you can copy+paste the request headers for the site you are trying? i need more exact info

Load the URL in chrome and hit up the inspection > network tab

image

alfablend commented 5 months ago

Thanks for your answer!

When I try to open link to CSV file in Chrome it automaticly download a file with .csv extension. Chrome window stay blank.

изображение

So network tab, as far as I understand, is blank too.

изображение

I use plain parser in changedetection.io to work with CSV files, Chrome mode is not working with these files.

dgtlmoon commented 5 months ago

@alfablend use curl from command line instead

$ curl --head https://changedetection.io/CHANGELOG.txt
HTTP/2 200 
server: nginx
date: Tue, 26 Mar 2024 15:13:13 GMT
content-type: text/plain
content-length: 86815
last-modified: Tue, 26 Mar 2024 15:01:02 GMT
vary: Accept-Encoding
etag: "6602e32e-1531f"
strict-transport-security: max-age=63072000
accept-ranges: bytes

try that

alfablend commented 5 months ago

Thanks, done it (changed the link in your command to my link first).

изображение

As I see, there is UTF-8 charset in this response. But it is not similar as downloadable CSV file itself encoding, that is windows-1251. May be is there any way to force using windows-1251 charset?

dgtlmoon commented 5 months ago

it seems the server is returning the wrong information, your CSV is reported as "text/html"

can you attach the CSV file?

alfablend commented 5 months ago

Thank you for explanation! Thats the file urvi (1).csv

dgtlmoon commented 5 months ago
import chardet

def detect_encoding(file_path):
    with open(file_path, 'rb') as f:
        rawdata = f.read()
        result = chardet.detect(rawdata)
    return result

result = detect_encoding("urvi.1.csv")
print("The encoding of the file is:", result['encoding'])
print("Confidence level:", result['confidence'])
$ python3 ./test.py 
The encoding of the file is: windows-1251
Confidence level: 0.9414230748073508

so the file is windows-1251 but the web server is reporting the wrong encoding type

i'm also not sure if windows-1251 is supported by any of our text difference handlers, more than likely not...

alfablend commented 5 months ago

Thank you! If I understand you right, there is general problem with non-unicode (non-latin) content. And the solution may be finding preprocessor (charset converter).

Due the Wikipedia windows-1251 charset is still "the second most-used single-byte character encoding (or third most-used character encoding overall)". But, of course,it is still small percents in the scale of global internet, and, I understand, it may be not the priority task.

dgtlmoon commented 5 months ago

Thank you! If I understand you right, there is general problem with non-unicode (non-latin) content. And the solution may be finding preprocessor (charset converter).

the software already has the chardet detection library installed :) so first is to write some tests and understand the relationship between the windows encoding type, and websites that return the wrong mime type