BdR76 / CSVLint

CSV Lint plug-in for Notepad++ for syntax highlighting, csv validation, automatic column and datatype detecting, fixed width datasets, change datetime format, decimal separator, sort data, count unique values, convert to xml, json, sql etc. A plugin for data cleaning and working with messy data files.
GNU General Public License v3.0
151 stars 8 forks source link

Autodetection of columns not working for a more-than-one-row header #46

Closed Friedi closed 1 year ago

Friedi commented 1 year ago

Hi,

I miss a setting were the header can be more complex than just one row. Many data logs have a header containing additional information spreading over more than one line and then starting with the actual header for the column names. I have three ideas how to implement it:

  1. Sometimes the header is distinguishable by a first char (example: ~).
  2. If the header cannot determined automatically there could be a setting which provides the row of the columns starting. (in this example Line 13, with the data starting in line 14)
  3. Another way could be to start the autodetection not in the beginning of the data but in the first couple of not empty lines from the end of a file.

example file (I hope the data is not altered by the editor, the data here is TAB delimited): ~Resultfile from Basytec Battery Test System ~Date and Time of Data Converting: 10.11.2022 12:59:26 ~ ~Name of Test: Test Battery xyz ~Battery: LI-123_yx ~Testplan: LI-123_yx-Test.pln ~Testchannel: 1054 CH11 CTS ~Start of Test: 10.11.2022 10:38:59 ~End of Test: 10.11.2022 12:52:38 ~Operator (Test): justme ~Operator (Data converting): justme ~ ~Time[h] DataSet t-Step[h] t-Set[h] Line Command U[V] I[A] Ah[Ah] Ah-Charge Ah-Discharge Ah-Step Wh[Wh] T1[°C] R-AC R-DC Climate-T Cyc-Count Count State 0 1 0 0 2 Pause 4.14033592686097 0 0 0 0 0 0 42.96556 0 0 43 1 1 3 5.5E-7 2 5.5E-7 5.56111111111111E-7 2 Pause 4.14033592686097 0 0 0 0 0 0 42.96556 0 0 43 1 1 0 0.0002777775 3 0.0002777775 0.0002777775 2 Pause 4.14033592686097 0 0 0 0 0 0 42.96907 0 0 43 1 1 2

BdR76 commented 1 year ago

Thanks for posting the issue with the detailed description. I've personally never come across any situation with a csv files with comments at the start, but this is an interesting request.

As far as comments in a csv files go, the consensus seems to be to use a # character to indicate a comment lines. But I guess the specific character could be set in the plug-in settings. Btw the line starting with ~Time[h] should probably not include the ~ comment character, as the header line is usually also part of the data.

When the comment character is known (through settings or otherwise), then the plugin could detect comment lines, skip the first X lines and also set some kind of skip-lines setting in the schema.ini metadata for that csv file. So in your case for example SkipLines=12. Although SkipLines is not formally part of the schema.ini standard, but I guess the plugin is free to deviate from it a bit.

I'll look into this when I have the time.

Friedi commented 1 year ago

As far as comments in a csv files go, the consensus seems to be to use a # character to indicate a comment lines. But I guess the specific character could be set in the plug-in settings. Btw the line starting with ~Time[h] should probably not include the ~ comment character, as the header line is usually also part of the data.

Yes there are different chars for different data loggers (I come from the engineering side, it seems they are quite flexible with standards ;-) ). Some even do not have any comment identifiers, like the keysight (former Agilent) data logger. You are right with ~Time[h]. I also would have expected the row for the names without ~, but this is how the data log was exported. This makes it more difficult but it equals the case with no leading comment char (which is quite common at least for my logfiles)

If you start counting the columns (you could count the column separators) from the end of the file (skipping the empty ones). Then you can also determine the start of the row where the column names are by counting the same amount of separators and subsequently you know where the data starts. (there is one flaw with that, if the columns are not all filled AND if the empty "cells" do not come with a separator)

Thanks for taking a look at it, when you have time.

btw: the feature to reformat the csv into a fixed width column representation is gold. Together with coloring it is awesome .

BdR76 commented 1 year ago

I've added support for comment character and skip lines. This will be available in the next release.

I've build the new DLLs as a sort of beta version v0.4.6.3β see here, so you can preview this feature. Let me know if this works for your data files with comments.

BdR76 commented 1 year ago

I've also posted this as a suggestion on the Microsoft website, for anyone interested please upvote 👍

Text driver csv, add a SkipLines feature to schema.ini to skip comment lines

Friedi commented 1 year ago

I've added support for comment character and skip lines. This will be available in the next release.

I've build the new DLLs as a sort of beta version v0.4.6.3β see here, so you can preview this feature. Let me know if this works for your data files with comments.

Happy New Year! Thank You for this present. I have tested v0.4.6.3β. This works for me with the data I have tested. #45 is also not producing an error anymore. Reformatting works fine. 🥇

BdR76 commented 1 year ago

This is fixed in the latest release of the plugin v0.4.6.3, and it will be automatically available in the next update of Notepad++.

Currently the plug-in only supports comment lines at the start of the file. Using an explicit comment charcter to distinguish comment lines throughout the data is addressed in issue #48