biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis
https://orangedatamining.com
Other
4.85k stars 1.01k forks source link

CSV file autoformat error #6052

Closed hydrastarmaster closed 2 years ago

hydrastarmaster commented 2 years ago

In the "CSV File import" widget, the automatic type assigned to numbers truncates values to N e+16. In this example, it's also strange that the column "Feat2" is shown as descriptive after the "Group by" action. Interposing a "Select column" does not give any improvement.

Here attached, the files to reproduce the error. Orange - Check 6052.zip

janezd commented 2 years ago

Which feature and which row? As far as I checked, the data is read correctly.

"Group by" puts the grouping variables among meta attributes to distinguish them from computed statistics (also: see documentation). You can use Select Columns if you for some reason wish to put them back.

hydrastarmaster commented 2 years ago

Windows 10 current release, Orange 3.32.0. Here are some screenshots to show you how I see data (following the files I've attached in the first post).

1) RE: Autoformat Original data (CSV as viewed by Excel; in this case w/ decimal separator ","): image Apply this: image and inside the widget (and in the following data table) there is: image As you can see (eg. row 4 col 3), the E+ is truncated (max exponent expressed in the output is E+16). This is due to the automatic assignation [whilst it works well when numeric type is manually assigned].

2) RE: "Group by" meta anomaly Having this: image Input table: image Grouped this way: image Output table: image And "Feat2" should not be (also) a meta-feature (the 2nd column, whilst the 3rd column is the legit one).

RE: https://orange3.readthedocs.io/projects/orange-visual-programming/en/latest/widgets/data/groupby.html The example shows an "Iris" feature without Aggregations applied (that's the main field of the aggregation, obviously). But in the "Group by" widget, unchecking the "Concatenate" makes "Feat1" disappear. I suppose the difference is that "Iris" is defined as Class, while "Feat1" is automatically recognized as a text feature (... pretend it contains license plate numbers, thus more than max number of values to be automatically recognized as a class ...), and the widget treats them differently (maybe in an unexpected way). A "Select column" widget applied before the "Group by" does not bring any change.

janezd commented 2 years ago

This is the content of your file:

Screen Shot 2022-07-05 at 23 06 16

The value in question is 7.24646E+14 (not ...E+19), and Orange reads it correctly.

In the Group By widget, you select which aggregations to compute for each attribute. You can choose multiple functions for a single attribute, or none. If you select multiple (e.g. first value, last value and random value), the attribute will appear multiple times. If you select none, it won't appear.

hydrastarmaster commented 2 years ago

1) You're right... Sorry. Well, thus the problem is with Excel and the decimal separator ... I saw a mirage earlier. Thanks.

2) I've changed just the type of "Feat1" to Class type (from Text): image Then in "Group by" the sequence has changed, and "Feat1" appears as the first row and works like in the Iris example, without any Aggregation tick: image And it comes out right: image Thus it is something to do with Class/Text differences. In any case, it seems to be manageable by the user.