SciRuby / daru-io

daru-io is a plugin gem to the existing daru gem, which aims to add support to Importing DataFrames from / Exporting DataFrames to multiple formats.
http://www.rubydoc.info/github/athityakumar/daru-io/master/
MIT License
24 stars 9 forks source link

Old text format importer #62

Open zverok opened 6 years ago

zverok commented 6 years ago

I am not sure how this format is properly called (investigate?), but it is pretty common for scientific and international standartization data. Example (official unicode tables, official timezones tables are also published in this format):

# Note: characters with PROSGEGRAMMENI are actually titlecase, not uppercase!

1F80; 1F80; 1F88; 1F08 0399; # GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI
1F81; 1F81; 1F89; 1F09 0399; # GREEK SMALL LETTER ALPHA WITH DASIA AND YPOGEGRAMMENI
1F82; 1F82; 1F8A; 1F0A 0399; # GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA AND YPOGEGRAMMENI
1F83; 1F83; 1F8B; 1F0B 0399; # GREEK SMALL LETTER ALPHA WITH DASIA AND VARIA AND YPOGEGRAMMENI
1F84; 1F84; 1F8C; 1F0C 0399; # GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA AND YPOGEGRAMMENI
1F85; 1F85; 1F8D; 1F0D 0399; # GREEK SMALL LETTER ALPHA WITH DASIA AND OXIA AND YPOGEGRAMMENI

E.g. it is a bit like CSV with ; separator but:

It will be a nice showcase to have those "standard" data parsed out-of-the-box.

athityakumar commented 6 years ago

I initially thought that just plainly using the CSV Importer with col_sep: '; ' option should be working. But, the Importer won't be able to ignore empty lines. After looking at one of the unicode tables, I think we'd also require this Importer support something like :start_row and :end_row (rather than :skiprows) to crop data in a better way.

zverok commented 6 years ago
  1. It is NOT a work for :csv importer, because this format is not valid CSV.
  2. It does NOT need "skiprows" option, it needs to ignore comments (comments could be in between lines, not only at the beginning of the file, and also at the end of line with data).

I believe that :plaintext importer initially meant to be handler for this format, just not finished. So, let's probably enchance it?