To include or omit flagged data? - Githubissues

BLE-LTER / BLEdatatools

R package to download and collate BLE LTER data from EDI repository

Other

0 stars 0 forks source link

To include or omit flagged data? #9

Open alinaCO2spera opened 2 years ago

alinaCO2spera commented 2 years ago

Should product include cells with flagged data? Can this be an argument in download_data()? How would this work for collated data?

twhiteaker commented 2 years ago

I typically see flagged data included, along with the flags. I can see how that could be tricky with collated data. What if two datasets used the same flag but with different meanings? What if a dataset included multiple fields with different types of flags. But if the scope is to just work with BLE data, maybe that won't be an issue.

atn38 commented 2 years ago

Exactly @twhiteaker, thanks for visiting us. BLE data already has both of your scenarios.

If we output an Excel option, the flag + flag definition can be lifted from EML metadata and encoded into a comment on the particular cell. This will be a challenge to implement! But I think it's amazing that this is even possible, thanks to the power of high quality metadata :) Not sure what other cell-level annotation options are there.

twhiteaker commented 2 years ago

I recommend avoiding non-vanilla features such as comments. I think it'd be better to have the flag definition in the table or on a separate sheet.

atn38 commented 2 years ago

True. Although I'd say Excel is already less vanilla than CSV, which we will also have. Philosophically, why use Excel if we aren't gonna take advantage of Excel features? IF we offer Excel output, it will be because people like things in one place, so I see possibly adding comments as an extension of "everything in one place". Practically, we probably won't have time/manpower.

twhiteaker commented 2 years ago

Excel is a common tool for scientists, so that's why it's a target for us. I'm not opposed to taking advantage of Excel features. If you want to put descriptions in comments, that's fine. But I could see people wanting to filter by flags and/or flag descriptions, which is why I think it's important to have them in cell values.

Brainstorming (and not using GitHub issues appropriately!), taking advantage of pivot tables in Excel to summarize data, or including some charts by default, would be nice features to take advantage of.

atn38 commented 2 years ago

good point about filtering by flag values @twhiteaker. That def tips the scales to having them in cells then.

atn38 commented 2 years ago

working on collating data now: one issue I'm getting is that our flag-type column names are not consistent. Some data tables have "flag_N" or more descriptive names, and some just say "flag". This makes it difficult to tell what flags apply to what data in a collated setting.

alinaCO2spera commented 2 years ago

@atn38 and I talked about this issue today and decided we are moving forward with including all flags columns in the collated data product so users can filter by and/or remove flagged data as they wish. For the most part flag column names are not a problem. The exception is sediment pigment flags which do not have descriptive names due to the original data formatting.

@atn38 : do you think fixing sediment pigment flag column(s) happen inside collate_data()?

atn38 commented 2 years ago

@alinaCO2spera yea, can be direct lines of code inside collate_data, or in a helper function that then gets called in collate_data. I'm actually sorta doing something about the sediment pigment flags -- I'm gonna append the pigment name after flag and making it wider, e.g. there are a bunch of columns named "flag_Alloxanthin" "flag_Fucoxanthin" etc in the collated product now.