FlineDev / CSVImporter

Import CSV files line by line with ease
MIT License
148 stars 31 forks source link

Automatically determine the encoding of the file #29

Open dkalinai opened 6 years ago

dkalinai commented 6 years ago

Hi there, again thanks for making this since it saves tons of time.

Could you point me in the code or explain how does the importer determine what type of encoding the file is in when importing. I need to somehow extract this information and not sure how to do that. Maybe you can give me a hint where to look. not a bug more like request for information. And is there actually an automatic encoding determination or am i misinterpreting things?

   ```

guard let csv = CSVImporter<[String: String]>(url: fileURL) else { return }

    csv.startImportingRecords(structure: { (headerValues) -> Void in
        print(headerValues)

    }) {$0}.onFinish {importedRecord in
        print(importedRecord)

    }
Jeehut commented 6 years ago

I think what you're looking for is this line. This means, when creating a CSVImporter object you can pass a parameter named encoding with your files encoding. By default it's set to .utf8.

dkalinai commented 6 years ago

I see, so no way to determine encoding automatically then. Tough :( there is a method on NSString that computes that from an NSData object. Will try to look into that then on my own. Thanks for having a look.

Jeehut commented 6 years ago

As you can see here we already have logic in place which will automatically determine the type of line ending of the file when .unknown is specified by the user. I'm not against using the exact same strategy for encoding as well. This could be done e.g. by making encoding an optional parameter on the init method, and if it is nil we could use the FileSource to determine the encoding.

Would you be up adding this feature youself and sending a PR with tests and docs updated? If there's a method on NSString/NSData which can handle that, than it should be pretty straight forward to implement since you have the same logic for line ending already in place. That would be really awesome. I'm reopening this issue and renaming it to describe this feature.

dkalinai commented 6 years ago

I can make a PR, probably in next few days, I have already found a solution to this by the way and made a simple String extension that returns the encoding to me in String.Encoding format.

The only other issue and a bit off topic here is the delimeters (can be ; as well sometimes) and if one can process a string from memory as a CSV file. Because the NSString method i am referring to not only guesses the encoding but also returns the string to you which would potentially need to be handled by the importer on the fly rather than from a file.

Jeehut commented 6 years ago

Sounds good. Note that one of the advanatages of CSVImporter is that it's able to read big files faster and more safely since it doesn't read the entire file at once, which your solution probably does. So that's another plus on implementing this in CSVImporter.

I don't really understand your other problems though. You probably would need to post some code so I can understand. Note though, that if it's a different problem than this one, it's probably better you open another issue for each problem.

gaming-hacker commented 6 years ago

how does csvimporter handle garbage? i have a specific data structure but it can be corrupted or fields missiing or added so i need to add some regex.

Jeehut commented 6 years ago

CSVImporter generally expects a valid CSV file according to RFC 4180 which specifies:

Within the header and each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and should not be ignored. The last field in the record must not be followed by a comma.

When a line for example doesn't have the same number of fields, then – at the moment – the entire line is simply ignored. That's not required by the RFC though (that's why it's a "should" not a "must"), so we could implement multiple different fallback strategies and let the user choose between them.

Can you give examples of lines and how they are "corrupted"? Depending on the case, I'm perfectly okay with a little more accommodating behavior, so long as it doesn't conflict with the RFC.

Feel free to post a PR with the changes you need and I'll have a look. As long as it is an opt-in feature, is documented (in the README) and is covered by tests (your corrupted file), I'm happy to merge it!