biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis
https://orangedatamining.com
Other
4.89k stars 1.03k forks source link

File widget: add option to skip a range of rows not holding headers or data #6675

Closed wvdvegte closed 11 months ago

wvdvegte commented 11 months ago

What's your use case? Many datasets that are available online as open data have rows that are not column headers or actual records holding the data. Often they hold descriptions of the features and/or other general data such as licensing info. These could be either above or below the actual data. Also, some datasets have column headers spanning more than one row. When importing these datasets in Orange, these rows confuse the mechanisms to recognize variables and variable types.

What's your proposed solution? It would be nice to be able to specify a range of rows that has to be disregarded, e.g. rows 3-5, when importing a file

Are there any alternative solutions? Using a spreadsheet to do it manually.

markotoplak commented 11 months ago

Did you try the "CSV File Import"? That one allows skipping rows. Right-click the row index and you'll see the option to skip it.

Please report if you managed to do anything useful with it. Thanks!

wvdvegte commented 11 months ago

Thanks, @markotoplak , I wasn't aware of it. It works, although I still have a suggestion to improve this functionality: allow shift-clicking to select multiple rows to skip. For instance, the weather data files from the Dutch meteorological institute start with 58 rows to be skipped, and in addition, row 60 has to be skipped as well. Doing this one by one is a cumbersome job.

janezd commented 11 months ago

Notes from discussion: we could improve the csv reader by

wvdvegte commented 11 months ago

Sounds good to me - just one question: why is this possible with CSV files only? If I want to skip rows in an xlsx file, I'd have to convert it to csv first ...

janezd commented 11 months ago

The File widget and the CSV widget have different sets of features. We would like to have a single widget that would support everything, but merging them into one is such a challenge that nobody has volunteered so far. :(

Another problem is that the CSV widget is written really well and nobody would like to spoil it, while the File widget is rather bad and nobody would like to touch it.