Normalize character encodings

Webdevdata / webdevdata.org

Website for reports, etc.

44 stars 7 forks source link

Normalize character encodings #24

Open ernesto-jimenez opened 10 years ago

ernesto-jimenez commented 10 years ago

Currently the files are in different character encodings which makes it tricky to do some stats.

e.g.: I'm currently generating CSV's by parsing the documents and the resulting CSVs have a mix of character encodings which is not acceptable by some tools such as csvkit

It would be nice to normalize the output to UTF8 so that consumers of the data don't need to do it.

yoavweiss commented 10 years ago

I don't think we should, since we will lose some of the data in the process (e.g. charset that are not explicitly defined will not be found when doing a "how many pages have this charset" queries)

ernesto-jimenez commented 10 years ago

If we don't normalise, I would at least have a quick short-cut to check what the charset so that consumers don't need to parse headers and tags to get the encoding.

something like having example.com_hash.charset alongside example.com_hash.html.txt

That way consumers would be able to normalize the data themselves by just using: iconv -f $(example.com_hash.charset) -t UTF8 example.com_hash.html.txt.

Otherwise each consumer is going to have to implement the logic to normalize the encoding, when we could be do it only once and allow consumers to leverage it.

yoavweiss commented 10 years ago

I'm cool with adding post-processing tools that do that, but it shouldn't be part of the fetch process IMO.

With that said, the bundle's format may shift soon, so it might be better to wait with such work.

ernesto-jimenez commented 10 years ago

Sure, I meant doing it on post-processing, not doing it during the fetch process.

To recap, the two options I see are:

bundle HTML normalised to UTF with a hint on what the original encoding was (original proposal + hint): processing would then be simpler and with the hint on the original encoding you can still run stats on websites with certain encoding.
bundle HTML in the original encoding with a hint on what that encoding is (my second comment): when processing the consumer will need to do the encoding normalisation with iconv based on the provided hint.

I would rather go for option 1 since running stats based on the encoding would be a 20% case. That way we would simplify the 80% of the processing efforts.

Another middle-ground option would be to deliver option 2 + a script win webdevdatatools.org to normalize the data.

In any case, I'm OK with waiting to the new bundle format.